Great new features have bundled and are now available in Power BI. With December 2020 update, all of the features described in this blog post will be available.
Once you downloaded the latest update (as of writing of this blog, this is December 2020 update), open Power BI Desktop and head to Options and Settings -> Options and go to Preview feature.
Make sure to have Anomaly detection and Small multiples checked. Restart Power BI Desktop.
Small multiples
Small multiples is a layout of small charts over a grouping variable, aligned side-by-side, sharing common scale, that is scaled to fit all the values (by grouping or categorical variable) on multiple smaller graphs. Analyst should immediately see and tell the difference between the grouping variable (e.g.: city, color, type,…) give a visualized data.
In Python, we know this as trellis plot or FacetGrid (seaborn) or simply subplots (Matplotlib).
In R, this is usually referred to as facets (ggplot2).
These multiples are customizable to show you number of graphs per column and row, can be type of line-chart, bar-chart, area-chart and I am positive, that it will get more visuals supported in following updates (this is still under preview as of writing this post).
Anomaly detection
Anomaly detection is a process of detecting and finding the values that deviate from normal (base) line. It can be described as outlier detection. Anomaly detection currently supports finding anomalies in time series data, and can provide also explanation of finding the root cause analysis. This is part of adding more analytics out of the box into Power BI (like Key influencers, decomposition tree and many additional visuals based on R or Python available in marketplace).
You can also fine-tune the sensitivity and get explanation of the anomalies by adding a variable for the anomaly to be explained by.
Visual Zoom Slider
Visual Zoom slider was introduced in November 2020 Power BI update and gives you the option to zoom in on dataset without having to use or change the filters. Zoom slider will examine a smaller portion of the chart and give you a detailed look on the data.
This is a step in right directions for the analysts. And a great response to capabilities that bring R or Python libraries for visualisations, like Plotly. Plotly gives you the capability to zoom in on smaller range of data, to examine or understand a denser set on the graph.
Here is a R exampla of plotly:
######################## ### Creating random data ######################## set.seed(2908) df <- data.frame(time = seq(1,1000, by=1), value = sample(1:34, size=1000, replace=TRUE), valueA = sample(1:100, size=1000, replace=TRUE), valueB = sample(1:100, size=1000, replace=TRUE), city = sample(LETTERS[1:4], size=1000, replace=TRUE), dist = runif(1000), Date = sample(seq(as.Date('2016/01/01'), as.Date('2021/01/01'), by="day"), 1000)) ######################## ### ggplot scatter plot ### Plotly ####################### library(plotly) library(ggplot2) p <- ggplot(df, aes(valueA, valueB)) + geom_point() fig <- ggplotly(p) fig
And this will generate same looking chart as in Power BI with command line in right upper corner, giving data engineer to zoom in (pane in / pane out) on smaller range of data.
Great and new features available for you in Power BI, making data engineering and data science even easier.
As always, the sample, code and PBIX file is available on Github.
by Hong Ooi
Last week, I announced AzureCosmosR, an R interface to Azure Cosmos DB, a fully-managed NoSQL database service in Azure. This post gives a short rundown on the main features of AzureCosmosR.
Explaining what Azure Cosmos DB is can be tricky, so here’s an excerpt from the official description:
Azure Cosmos DB is a fully managed NoSQL database for modern app development. Single-digit millisecond response times, and automatic and instant scalability, guarantee speed at any scale. Business continuity is assured with SLA-backed availability and enterprise-grade security. App development is faster and more productive thanks to turnkey multi region data distribution anywhere in the world, open source APIs and SDKs for popular languages. As a fully managed service, Azure Cosmos DB takes database administration off your hands with automatic management, updates and patching. It also handles capacity management with cost-effective serverless and automatic scaling options that respond to application needs to match capacity with demand.
Among other features, Azure Cosmos DB is notable in that it supports multiple data models and APIs. When you create a new Cosmos DB account, you specify which API you want to use: SQL/core API, which lets you use a dialect of T-SQL to query and manage tables and documents; MongoDB; Azure table storage; Cassandra; or Gremlin (graph). AzureCosmosR provides a comprehensive interface to the SQL API, as well as bridges to the MongoDB and table storage APIs. On the Resource Manager side, AzureCosmosR extends the AzureRMR class framework to allow creating and managing Cosmos DB accounts.
AzureCosmosR is now available on CRAN. You can also install the development version from GitHub, with devtools::install_github("Azure/AzureCosmosR")
.
The meat of AzureCosmosR is a suite of methods to work with databases, containers (tables) and documents (rows) using the SQL API.
library(dplyr) library(AzureCosmosR) # endpoint object for this account endp <- cosmos_endpoint( "https://myaccount.documents.azure.com:443/", key="mykey" ) # all databases in this account list_cosmos_databases(endp) # a specific database db <- get_cosmos_database(endp, "mydatabase") # create a new container and upload the Star Wars dataset from dplyr cont <- create_cosmos_container(db, "mycontainer", partition_key="sex") bulk_import(cont, starwars) query_documents(cont, "select * from mycontainer") # an array select: all characters who appear in ANH query_documents(cont, "select c.name from mycontainer c where array_contains(c.films, 'A New Hope')")
You can easily create and execute JavaScript stored procedures and user-defined functions:
proc <- create_stored_procedure( cont, "helloworld", 'function () { var context = getContext(); var response = context.getResponse(); response.setBody("Hello, World"); }' ) exec_stored_procedure(proc) create_udf(cont, "times2", "function(x) { return 2*x; }") query_documents(cont, "select udf.times2(c.height) from cont c")
Aggregates take some extra work, as the Cosmos DB REST API currently only has limited support for cross-partition queries. Set by_pkrange=TRUE
in the query_documents
call, which will run the query on each partition key range (physical partition) and return a list of data frames. You can then process the list to obtain an overall result.
# average height by sex, by pkrange df_lst <- query_documents(cont, "select c.gender, count(1) n, avg(c.height) height from mycontainer c group by c.gender", by_pkrange=TRUE ) # combine pkrange results df_lst %>% bind_rows(.id="pkrange") %>% group_by(gender) %>% summarise(height=weighted.mean(height, n))
Full support for cross-partition queries, including aggregates, may come in a future version of AzureCosmosR.
You can query data in a MongoDB-enabled Cosmos DB instance using the mongolite package. AzureCosmosR provides a simple bridge to facilitate this.
endp <- cosmos_mongo_endpoint( "https://myaccount.mongo.cosmos.azure.com:443/", key="mykey" ) # a mongolite::mongo object conn <- cosmos_mongo_connection(endp, "mycollection", "mydatabase") conn$find("{}")
For more information on working with MongoDB, see the mongolite documentation.
You can work with data in a table storage-enabled Cosmos DB instance using the AzureTableStor package.
endp <- AzureTableStor::table_endpoint( "https://myaccount.table.cosmos.azure.com:443/", key="mykey" ) tab <- AzureTableStor::storage_table(endp, "mytable") AzureTableStor::list_table_entities(tab, filter="firstname eq 'Satya'")
A good introduction to Azure Cosmos DB can be found here, or you can browse the official documentation. If you have any questions or feedback about the AzureCosmosR package, you can open an issue or email me at hongooi73 (@) gmail.com.
Teaching basic data science, machine learning, and statistics is great due to the questions. Students ask brilliant questions, as they see what holes are present in your presentation and scaffolding. The students are not yet conditioned to ask only what you feel is easy to answer or present. They ask what is needed to further their understanding given the tools that have been already established.
This is one reason recordings can rarely match the experience of a good live course.
Such questions can take a bit of work to answer, another sign that they are valuable questions.
One data science question I am commonly asked is:
What is a good test set size?
Let’s take the time to give this question some serious consideration. I am going to have to address this question in a way I can not allow myself in class, by bringing in new tools that haven’t been previously established.
For concreteness, let’s suppose we have a supervised classification problem. We could also work similar arguments for regression.
We have, for each index i
(called “rows” or “instances”), a numeric vector x(i)
of explanatory variables (values we think we will have access to in the future), and a dependent variable y(i)
that takes on the values TRUE
and FALSE
. Our goal is to learn from our training data a function f()
, ideally such that f(x(i))
usually near 1
when y(i)
is TRUE
and f(x(i))
is usually near 0
when y(i)
is false. The idea is: in the future we may have x(i)
and not y(i)
, so such an prediction procedure can be very valuable. For a general methodology how to convert arbitrary explanatory variables to numeric ones, please see the vtreat
package (or alternately vtreat
for Python).
Why we want f()
to be a score, instead of a decision rule, can be found in our video “Against Accuracy”.
In this note we will answer “what is a good test set size?” three ways.
Each of these answers is a bit different, as they are solved in slightly different assumed contexts and optimizing different objectives. Knowing all 3 solutions gives us some perspective on the problem.
First let’s remind ourselves why we need test data. Estimating the performance of a model on the same data used to fit the model, or even merely choose the model, is unreliable, unsafe, and biased. Some clear notes on this can be found here in our note “The Nature of Overfitting”.
We need a reliable way to measure the performance of our fit function or model. Reserving some of the data for hold-out testing is one way to get a reliable future performance estimate.
My usual answer is to the “what is a good test set size?” is:
Use about 80 percent of your data for training, and about 20 percent of your data for test.
This pretty standard advice. It is works under the rubric that model fitting, or training, is the harder task- so it should have most of the data. We say anything you can do with all the data, you have good chance of accomplishing with 80 percent of the data. And we use the remaining 20 percent for hold-out testing.
From a practical point of view, we are pretty much done. However, if we are already comfortable with some of the math, we can go on and analyze the data science process itself.
Test data is used to estimate the future performance of a model on new data that was not known or seen during training. This sort of experiment can be characterized as a decision problem. In this section I want to convince you that from an acceptance test point of view it makes sense to talk of test set size in terms of row-count, and not in terms of what fraction of the available data it is. Once we have enough test rows, it doesn’t matter how many more rows we could have used.
Suppose we have fit our model f()
, and we have chosen what is called a “loss” or criticism function. In our case this is a function we can evaluate “per-row” and is of the form loss(prediction, truth)
, where prediction
is the number predicted, and truth
is the TRUE
or FALSE
we are trying to match. For a binary (two outcome) classification problem we strongly recommend using “deviance” or using “cross entropy” for the loss function.
Let’s take as our specific loss the following collared or Winsorized version of the deviance.
loss(prediction, truth) := -2 * (truth * log2(max(mu, p)) + (1-truth) * log2(max(mu, 1-p)))
where mu
is a small positive constant, say 1/100
. mu
is essentially saying: don’t allow the model to make predictions outside of the range [mu, 1-mu]
, which can be good advice (essentially Cromwell’s rule).
An important feature of the above loss is: it is always in the range [0, -2 * log2(mu)]
. For our assumed mu = 1/100
, this means the per-row loss is always in the range [0, 14]
. The bounded range is the only fact about the loss function we will use in this section!
We want mean(loss(prediction(i), truth(i)))
evaluated on our test set to be very near the value we would see for E[loss(prediction, truth)]
with E[]
being the expected value over the assumed population all samples are drawn from. That is, we want a reliable estimate of future performance.
To achieve this, we need a sample size that makes the above correspondence true with very high probability. Such a size can be read off Hoeffding’s inequality.
For our application we can take Hoeffding’s inequality to be the following.
For
t > 0
, the mean ofk > 0
identically distributed independent measurements that are all numbers in the range[0, U]
is such thatP[|observed_mean - expected_value| > t] < 2 exp(-2 k t^2 / U^2)
.
So if we want only a epsilon
chance of our test set being more than t
units off in estimating future expected model performance, then it is enough to pick k
such that 2 exp(2 k t^2 / U^2) <= epsilon
.
Or, a test set size of k >= ln(2/epsilon) U^2 / (2 t^2)
is a large enough test set. Calculations using distributional details of the loss function will yield tighter bounds.
We can plug in some example values:
U
14, as above, from picking mu = 1/100
.epsilon
0.05, or a 5 percent chance of being further then mu
off in our performance estimate.t
, say 1/10
, or wanting our estimate to be likely to be with 0.1 deviance units of the unkown true value.This gives us any k >= ln(2/0.05) * 14^2 / (2 (1/10)^2)
is good test set size. Or, if we use k >= 36152
test rows then with probability at least 95 percent we have an estimate that is within 1/10th deviance unit of the unknown true future performance of our classifier, independent of training set size and number of considered variables.
Notice k
is a count depending only on the expected range of the loss (U
), how large an error we wish to be unlikely (t
), and what our definition of unlikely is (epsilon
). k
is independent of the available data size and complexity of the model fitting procedure. So in this case, k
is not a fixed fraction of the available data (such as “20 percent”).
There is an alternate answer to “what is a good test set size?” that is a fraction of the available data, and doesn’t require specifying a desired “rare error” size. This is a bit less standard that the two earlier variations, but it is quite interesting. It is something new, that I would like to develop here.
What we are going to do is: split the data so we are improving our training and test reliabilities at roughly the same rate. In this scheme we assume there is no point making a good evaluation of a bad model, and no point building a good model if we can’t evaluate it.
We will assume that we can write the unreliability of our joint model fitting and model testing or evaluation procedure as an additive process. It may not truly be so, but let’s so estimate it.
We will try a path-oriented criticism, building up the overall criticism as moving from one concern to another. Our overall criticism is the sum of: how much our test set differs from testing on the entire population, plus how much our training procedure differs from training on the entire population, plus how much an ideal model differs from the data.
We are going to assume our overall model loss is of the form:
E[overall_criticism^2] ~ c_test / m_test + c_train / m_train + c_0
Where m_train
is the number of rows or examples used to fit the model, m_test
is the number of rows used to test or evaluate the model, and c_train
, c_test
, c_0
are unknowns to be solved for, or estimated. This form is motivated in the appendix. The relation may, at first, appear odd. It is in fact merely saying we anticipate the loss squared behaves like a sample variance: the expected square size shrinks linearly in sample size.
If we accept the form E[overall_criticism^2] ~ c_test / m_test + c_train / m_train + c_0
, we can then solve for the optimal partition into test and training sets, with respect to this formulation, and assuming we have estimates for c_test
and c_train
. Knowing the structure of the solution will be very useful in forming new intuition.
To solve the constrained optimization problem we set up a Lagrangian function for our problem. In our case this is:
L := sqrt(c_test / m_test + c_train / m_train + c_0) + lambda ( m_train + m_test - m )
This is standard. The first term is the function we are trying to optimize, the second term is a constraint we expect to be zero (that m_train + m_test = m
, the total number of available rows), and lambda
is a Lagrange multiplier to be solved for. We inspect stationary points of L
, places where the derivatives are zero for possible optimal solutions.
d L / d m_train = (1/2) L^(-1/2) * (-1/2 c_train / m_train^2) + lambda = 0 d L / d m_test = (1/2) L^(-1/2) * (-1/2 c_test / m_test^2) + lambda = 0
So we expect:
m_test = Z c_test^0.5 m_train = Z c_train^0.5 m_train + m_test = m
Where Z = 1/(2 * L^(1/4) * lambda^(1/2))
. Or, after some algebra to eliminate Z
:
m_test = m * (c_test)^0.5 / ( (c_test)^0.5 + (c_train)^0.5 ) m_train = m * (c_train)^0.5 / ( (c_test)^0.5 + (c_train)^0.5 )
If we had estimates of c_train
, and c_test
, they would determine a fraction of the data to devote to testing. For example, we have a synthetic example where we estimated c_train
and c_test
as the following.
c_test = 1.6 c_train = 79
These estimated coefficients indicate how much harder training is for testing for this model family on the example data set. Using these estimates we have:
soln <- list( fraction_test = (c_test)^0.5 / ( (c_test)^0.5 + (c_train)^0.5 ), fraction_train = (c_train)^0.5 / ( (c_test)^0.5 + (c_train)^0.5 )) soln
## $fraction_test ## [1] 0.1245837 ## ## $fraction_train ## [1] 0.8754163
Or, for this specific data process and model family, we should put about 12.5 percent of the data into the test set.
The summary is: for this formulation c_train
and c_test
approximately quantify the relative difficulty or sensitivity of training versus test for a given data population and model family. Knowing the relative difficulty of training to evaluation lets us pick an optimal fraction of data to use in training versus test. Obvious factors that can determine the difficulty of fitting include model structure, number of variables, and relations between variables.
In this formulation, we improve both training reliability and test reliability simultaneously instead optimizing one, and using the rest of the data for the other.
We have worked out in detail how to pick supervised machine learning test set size. We covered
And that is just a few of the interesting ideas one can develop from the question “what is a good test set size?”
We can attempt to derive or motivate our variational loss model:
E[overall_criticism^2] ~ c_test / m_test + c_train / m_train + c_0
However, if one is willing to assume such a form, one can skip this derivation. In fact this entire appendix is just a very long-winded way of saying the loss squared should behave like a sample variance: the expected size should shrink linearly in sample size.
This appendix is exactly what I promise my students will not be presented in our courses: derivations that attempt to stand up core concepts they care about in terms of background concepts they may not care about! For our courses we are much stricter about assuming a small set of core concepts, and using only those. In this case, we would demonstrate the above relation empirically and move on.
However, as out of course preparation work, we are careful to have derivations. It cuts down our mistakes.
Let’s define:
With these terms we can choose our overall loss or criticism as follows. It is an attempt to separate loss contributions into components driven by the test data set size, the training data set size, and the family of models we are using.
We factor the expected value of our criticism as follows.
E[overall_criticism^2] ( our assumption that we can expand as a sum of squares ) := E[(test_criticism)^2 + (train_criticism)^2 + (ideal model ideal loss)^2] ( substituting in more precise definitions ) := E[(mean(fit model test loss) - (fit model ideal loss))^2] + E[((fit model ideal loss) - (ideal model ideal loss))^2] + (ideal model ideal loss)^2 ( replacing each expectation by an ideal form ) ~ c_test / m_test + c_train / m_train + c_0
Each of the above steps is something we are assuming, not something that is externally justified.
The asserted E[(mean(fit model test loss) - (fit model ideal loss))^2] ~ c_test / m_test
identification we used above is derived as follows. Take ifml
as equal to fit model ideal loss
, so that E[mean(fit model test loss) - ifml] = 0
.
E[(mean(fit model test loss) - (fit model ideal loss))^2] ( by definition ) = E[( (1 / m_test) sum_{i = 1 ... m_test} loss(prediction(x(i)), truth(i)) - ifml )^2] ( expanding the square of the sum, and pushing E[] deeper into the expression, by linearity of expectation ) = (1 / m_test)^2 sum_{i, j = 1 ... m_test} E[(loss(prediction(x(i)), truth(i)) - ifml) * loss(prediction(x(j)), truth(j)) - ifml)] ( applying independence ) = (1 / m_test)^2 sum_{i, j = 1 ... m_test} E[(loss(prediction(x(i)), truth(i)) - ifml)] * E[loss(prediction(x(j)), truth(j)) - ifml)] ( applying E[mean(fit model test loss) - ifml] = 0 ) = (1 / m_test)^2 sum_{i = 1 ... m_test} E[(loss(prediction(x(i)), truth(i)) - ifml)^2] ( removing index to simplify notation ) = (1 / m_test)^2 m_test E[(loss(prediction, truth) - ifml)^2] ( simplifying ) = (1 / m_test) E[(loss(prediction, truth) - ifml)^2]
Notice the final step is a term that shrinks at a rate of 1/m_test
.
We are assuming with less overt justification, that the training error term has a similar variance-like structure. The idea is: in model fitting a number of things are estimated, and the quality of these estimates depends on the size of the training set. For many specific models (linear models, decision trees, bagged decision trees, and more) we can derive the matching c_train / m_train
form using PAC-style or bounded VC-dimension-style arguments. So the form is, perhaps, plausible.
Diagnostic plots are always a nice, visually explicit way to assess our data and predictions, and I’ve just added a new one to the modEvA
R package. The ‘predPlot
‘ function plots predicted values separated into observed presences and absences, and coloured according to whether they’re above or below a given prediction threshold (the default threshold is species prevalence, i.e. proportion of presences, in the observed data). It shows whether there’s a visible split in the predicted values for presences and absences (see also functions predDensity
and plotGLM
in the same package, for alternative/complementary ways of visualizing this). The plot imitates (with permission from the author) one of the graphical outputs of the ‘summary
‘ of models built with the ‘embarcadero
‘ package (Carlson, 2020), but it can be applied to any ‘glm
‘ object or any set of observed and predicted values, and it allows specifying a user-defined prediction threshold. The ‘predPlot
‘ function is now included in package modEvA
(version >= 2.1, currently available on R-Forge).
predPlot <- function(model = NULL, obs = NULL, pred = NULL, thresh = "preval", main = "Classified predicted values", legend.pos = "bottomright") { # version 1.0 (20 Jan 2021) if (!is.null(model)) { if (!("glm" %in% class(model)) || family(model)$family != "binomial") stop("'model' must be of class 'glm' and family 'binomial'.") if (!is.null(obs)) message("Argument 'obs' ignored in favour of 'model'.") if (!is.null(pred)) message("Argument 'pred' ignored in favour of 'model'.") obs <- model$y pred <- model$fitted.values } else { if (is.null(obs) || is.null(pred)) stop ("You must provide either 'model' or a combination of 'obs' and 'pred'.") if (!is.numeric(obs) || !is.numeric(pred)) stop ("'obs' and 'pred' must be numeric.") if (length(obs) != length(pred)) stop("'obs' and 'pred' must have the same length.") } if (!(thresh == "preval" || (is.numeric(thresh) && thresh >= 0 && thresh <= 1))) stop ("'thresh' must be either 'preval' or a numeric value between 0 and 1.") if (thresh == "preval") thresh <- prevalence(obs) pred0 <- pred[obs == 0] pred1 <- pred[obs == 1] opar <- par(no.readonly = TRUE) on.exit(par(opar)) par(mar = c(5, 5.2, 3, 1)) plot(x = c(0, 1), y = c(-0.5, 1.5), xlab = "Predicted value", type = "n", ylab = "", yaxt = "n", main = main) axis(side = 2, at = c(0, 1), tick = FALSE, labels = c("Observed\nabsences", "Observed\npresences"), las = 1) abline(v = thresh, lty = 2) points(x = pred0, y = sapply(rep(0, length(pred0)), jitter, 10), col = ifelse(pred0 < thresh, "grey", "black")) points(x = pred1, y = sapply(rep(1, length(pred1)), jitter, 10), col = ifelse(pred1 < thresh, "grey", "black")) if (!is.na(legend.pos) && legend.pos != "n") legend(legend.pos, legend = c("Predicted presence", "Predicted absence"), pch = 1, col = c("black", "grey")) }
Usage examples:
library(modEvA) # load sample models: data(rotif.mods) # choose a particular model to play with: mod <- rotif.mods$models[[1]] # make some predPlots: predPlot(model = mod) predPlot(model = mod, thresh = 0.5) # you can also use 'predPlot' with vectors of observed and predicted values instead of a model object: myobs <- mod$y mypred <- mod$fitted.values predPlot(obs = myobs, pred = mypred)
References
Carlson C.J. (2020) embarcadero
: Species distribution modelling with Bayesian additive regression trees in R. Methods in Ecology and Evolution, 11: 850-858.
The Mendoza Line is a term from baseball. Named after Mario Mendoza, it refers to the threshold of incompetent hitting. It is frequently taken to be a batting average of .200, although all the sources I looked at made sure to note that Mendoza’s career average was actually a little better: .215.
This post explores a few questions related to the Mendoza line:
All code for this post can be found here.
(Caveat: I don’t know baseball well, so some of the assumptions or conclusions I make below may not be good ones. If I made a mistake, let me know!)
The data
The Lahman
package on CRAN contains all the baseball statistics from 1871 to 2019. We’ll use the Batting
data frame for statistics and the People
data frame for player names.
library(Lahman) library(tidyverse) data(Batting) data(People) # add AVG to Batting Batting$AVG <- with(Batting, H / AB)
First, let’s look for Mario Mendoza and verify that his batting average is indeed .215:
# find Mario Mendoza in People People %>% filter(nameFirst == "Mario" & nameLast == "Mendoza") # his ID is mendoma01 Batting %>% filter(playerID == "mendoma01") %>% summarize(career_avg = sum(H) / sum(AB)) # career_avg # 1 0.2146597
Was Mario Mendoza really so bad as to warrant the expression being named after him?
Let’s compute the career batting averages for the players and limit our dataset to just the players with at least 1000 at bats in their career:
# Batting average for players with >= 1000 AB avg_df <- Batting %>% group_by(playerID) %>% summarize(tot_AB = sum(AB), career_avg = sum(H) / sum(AB)) %>% filter(tot_AB >= 1000) %>% left_join(People, by = "playerID") %>% select(playerID, tot_AB, career_avg, nameFirst, nameLast) %>% arrange(desc(career_avg))
Let’s look at the top 10 players by batting average: we should see some famous names there! (If not, maybe 1000 ABs is not stringent enough of a criterion to rule out small sample size?)
# top 10 head(avg_df, n = 10) # # A tibble: 10 x 5 # playerID tot_AB career_avg nameFirst nameLast ## 1 cobbty01 11436 0.366 Ty Cobb # 2 barnero01 2391 0.360 Ross Barnes # 3 hornsro01 8173 0.358 Rogers Hornsby # 4 jacksjo01 4981 0.356 Shoeless Joe Jackson # 5 meyerle01 1443 0.356 Levi Meyerle # 6 odoulle01 3264 0.349 Lefty O'Doul # 7 delahed01 7510 0.346 Ed Delahanty # 8 mcveyca01 2513 0.346 Cal McVey # 9 speaktr01 10195 0.345 Tris Speaker # 10 hamilbi01 6283 0.344 Billy Hamilton
Next, let’s look at the bottom 10 players by batting average:
# bottom 10 tail(avg_df, n = 10) # # A tibble: 10 x 5 # playerID tot_AB career_avg nameFirst nameLast ## 1 seaveto01 1315 0.154 Tom Seaver # 2 donahre01 1150 0.152 Red Donahue # 3 fellebo01 1282 0.151 Bob Feller # 4 grovele01 1369 0.148 Lefty Grove # 5 suttodo01 1354 0.144 Don Sutton # 6 amesre01 1014 0.141 Red Ames # 7 faberre01 1269 0.134 Red Faber # 8 perryga01 1076 0.131 Gaylord Perry # 9 pappami01 1073 0.123 Milt Pappas # 10 frienbo01 1137 0.121 Bob Friend
Those numbers look quite a lot smaller than Mendoza’s! Notice also that all of them have ABs just over 1000, my threshold for this dataset. Maybe 1000 ABs is too loose of a condition… But Mendoza only had 1337 ABs, so if we make the condition more stringent (e.g. considering only players with >= 2000 ABs), it’s not fair to pick on him…
Among players with >= 1000 ABs, how poor was Mendoza’s performance?
# How far down was Mario Mendoza? which(avg_df$playerID == "mendoma01") / nrow(avg_df) # [1] 0.9630212
He’s roughly at the 5th quantile of all players for batting average.
How many players fell below the Mendoza line each year?
For this question, in each season we only consider players who had at least 100 ABs that season.
# Look at player-seasons with at least 100 ABs batting_df <- Batting %>% filter(AB >= 100)
The nice thing about the tidyverse is that we can answer the question above with a series of pipes ending in a plot:
batting_df %>% group_by(yearID) %>% summarize(below200 = mean(AVG < 0.200)) %>% ggplot(aes(yearID, below200)) + geom_line() + labs(title = "Proportion of players below Mendoza line by year", x = "Year", y = "Prop. below .200") + theme_bw()
There is a fair amount of fluctuation, with the proportion of players under the Mendoza line going as high as 24% and as low as 1%. If I had to guess, for the last 50 years or so the proportion seems to fluctuate around 5%.
What might the Mendoza line look like if we allowed it to change dynamically?
Instead of defining the Mendoza line as having a batting average below .200, what happens if we define the Mendoza line for a particular season as the batting average of the player at the 5th quantile?
We can answer this easily by summarizing the data using the quantile
function (dplyr v1.0.0 makes this easy):
batting_df %>% group_by(yearID) %>% summarize(bottom5 = quantile(AVG, 0.05)) %>% ggplot(aes(yearID, bottom5)) + geom_line() + geom_hline(yintercept = c(0.2), color = "red", linetype = "dashed") + labs(title = "Batting average of 5th quantile by year", x = "Year", y = "5th quantile batting average") + theme_bw()
That looks like a lot of fluctuation, but if you look closely at the y-axis, you’ll see that the values hover between 0.14 and 0.24. Here is the same line graph but with zero included on the y-axis and a loess smoothing curve:
batting_df %>% group_by(yearID) %>% summarize(bottom5 = quantile(AVG, 0.05)) %>% ggplot(aes(yearID, bottom5)) + geom_line() + geom_smooth(se = FALSE) + geom_hline(yintercept = c(0.2), color = "red", linetype = "dashed") + scale_y_continuous(limits = c(0, 0.24)) + labs(title = "Batting average of 5th quantile by year", x = "Year", y = "5th quantile batting average") + theme_bw()
I’m not sure how much to trust that smoother, but it comes awfully close to the Mendoza line!
For completeness, here are the lines representing the batting averages for players at various quantiles over time:
batting_df %>% group_by(yearID) %>% summarize(AVG = quantile(AVG, c(0.05, 1:4 / 5, 0.95)), quantile = c(0.05, 1:4 / 5, 0.95)) %>% mutate(quantile = factor(quantile, levels = c(0.95, 4:1 / 5, 0.05))) %>% ggplot(aes(x = yearID, y = AVG, col = quantile)) + geom_line() + geom_hline(yintercept = c(0.2), color = "red", linetype = "dashed") + labs(title = "Batting average for various quantiles by year", x = "Year", y = "Quantile batting average") + theme_bw()
At a glance the higher quantile lines look just like vertical translations of the 5th quantile line, suggesting that across years the entire distribution of batting averages shifts up or down (not just parts of the distribution).
I’ve really been enjoying the weekly data sets that the R4DS community has been putting out every Tuesday as part of the Tidy Tuesday project. In my previous blog I had a blast exploring the “Coffee Ratings” data set and putting out a YouTube video to share the insights that I found from the data.
In this blog we’re going to explore a more recent data set originally from the Tate Art Museum.
It says explicitly in the read.md file:
The dataset in this repository was last updated in October 2014. Tate has no plans to resume updating this repository, but we are keeping it available for the time being in case this snapshot of the Tate collection is a useful tool for researchers and developers.
So while the curating of new data is dead, it is still never the less cool to explore this data and learn about the Tate Museum.
While for some this is may be obvious (shout out to my British readers!), a Canadian like myself never heard of this museum. So before we run into exploring this data set, lets look Tate’s website and see what it can tell us:
“When Tate first opened its doors to the public in 1897 it had just one site, displaying a small collection of British artworks. Today we have four major sites and the national collection of British art from 1500 to the present day and international modern and contemporary art, which includes nearly 70,000 artworks.“
The readme.md file from the Tidy Tuesday repository tells us a little more about the data.
“Here we present the metadata for around 70,000 artworks that Tate owns or jointly owns with the National Galleries of Scotland as part of ARTIST ROOMS. Metadata for around 3,500 associated artists is also included.”
With this context, we’re ready to start our analysis.
What good is an analysis without having some questions to ask? Lets seek to answer some of the following questions.
While these questions are pretty basic, I think there is a lot to discover from these insights about the Tate museum and art history in general.
Lets get into it!
Lets get started with the standard procedure for loading and getting a general overview of our data set.
# Run once library(tidyverse) library(ggthemes) #Gotta make some pretty pictures. tuesdata<-tidytuesdayR::tt_load('2021-01-12') ## ## Downloading file 1 of 2: `artists.csv` ## Downloading file 2 of 2: `artwork.csv` artists<- tuesdata$artists artwork<- tuesdata$artwork glimpse(artists) ## Rows: 3,532 ## Columns: 9 ## $ id10093, 0, 2756, 1, 622, 2606, 9550, 623, 624, 625, 2411, 626, 627, 628, 629, 630, 2608, 631, 632, 633, 2, 3098, 8... ## $ name "Abakanowicz, Magdalena", "Abbey, Edwin Austin", "Abbott, Berenice", "Abbott, Lemuel Francis", "Abrahams, Ivor", ... ## $ gender "Female", "Male", "Female", "Male", "Male", "Male", "Female", "Male", "Male", "Male", "Male", "Male", "Male", "Ma... ## $ dates "born 1930", "1852–1911", "1898–1991", "1760–1803", "born 1935", "1964–1993", "born 1967", "born 1940", "1947–201... ## $ yearOfBirth 1930, 1852, 1898, 1760, 1935, 1964, 1967, 1940, 1947, 1938, 1728, 1868, 1927, 1917, 1878, 1895, 1904, 1927, 1912,... ## $ yearOfDeath NA, 1911, 1991, 1803, NA, 1993, NA, NA, 2014, NA, 1792, 1947, 2005, 1984, 1966, 1949, 1995, 1987, 1976, 1991, 184... ## $ placeOfBirth "Polska", "Philadelphia, United States", "Springfield, United States", "Leicestershire, United Kingdom", "Wigan, ... ## $ placeOfDeath NA, "London, United Kingdom", "Monson, United States", "London, United Kingdom", NA, "Paris, France", NA, NA, NA,... ## $ url "http://www.tate.org.uk/art/artists/magdalena-abakanowicz-10093", "http://www.tate.org.uk/art/artists/edwin-austi... glimpse(artwork) ## Rows: 69,201 ## Columns: 20 ## $ id 1035, 1036, 1037, 1038, 1039, 1040, 1041, 1042, 1043, 1044, 1045, 1046, 1047, 1048, 1049, 1050, 1051, 1052,... ## $ accession_number "A00001", "A00002", "A00003", "A00004", "A00005", "A00006", "A00007", "A00008", "A00009", "A00010", "A00011... ## $ artist "Blake, Robert", "Blake, Robert", "Blake, Robert", "Blake, Robert", "Blake, William", "Blake, William", "Bl... ## $ artistRole "artist", "artist", "artist", "artist", "artist", "artist", "artist", "artist", "artist", "artist", "artist... ## $ artistId 38, 38, 38, 38, 39, 39, 39, 39, 39, 39, 39, 39, 39, 39, 39, 39, 39, 39, 39, 39, 39, 39, 39, 39, 39, 39, 39,... ## $ title "A Figure Bowing before a Seated Old Man with his Arm Outstretched in Benediction. Verso: Indecipherable Sk... ## $ dateText "date not known", "date not known", "?c.1785", "date not known", "1826–7, reprinted 1892", "1826–7, reprint... ## $ medium "Watercolour, ink, chalk and graphite on paper. Verso: graphite on paper", "Graphite on paper", "Graphite o... ## $ creditLine "Presented by Mrs John Richmond 1922", "Presented by Mrs John Richmond 1922", "Presented by Mrs John Richmo... ## $ year NA, NA, 1785, NA, 1826, 1826, 1826, 1826, 1826, 1826, 1826, 1828, 1825, 1825, 1825, 1825, 1825, 1825, 1825,... ## $ acquisitionYear 1922, 1922, 1922, 1922, 1919, 1919, 1919, 1919, 1919, 1919, 1919, 1919, 1919, 1919, 1919, 1919, 1919, 1919,... ## $ dimensions "support: 394 x 419 mm", "support: 311 x 213 mm", "support: 343 x 467 mm", "support: 318 x 394 mm", "image:... ## $ width 394, 311, 343, 318, 243, 240, 242, 246, 241, 243, 236, 184, 197, 197, 200, 198, 198, 198, 199, 198, 198, 19... ## $ height 419, 213, 467, 394, 335, 338, 334, 340, 335, 340, 340, 150, 151, 153, 152, 152, 153, 153, 150, 152, 152, 15... ## $ depth NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,... ## $ units "mm", "mm", "mm", "mm", "mm", "mm", "mm", "mm", "mm", "mm", "mm", "mm", "mm", "mm", "mm", "mm", "mm", "mm",... ## $ inscription NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,... ## $ thumbnailCopyright NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,... ## $ thumbnailUrl "http://www.tate.org.uk/art/images/work/A/A00/A00001_8.jpg", "http://www.tate.org.uk/art/images/work/A/A00/... ## $ url "http://www.tate.org.uk/art/artworks/blake-a-figure-bowing-before-a-seated-old-man-with-his-arm-outstretche...
Before jumping into analysis lets check for missing data present in our data sets with thenaniar
package’s vis_miss()
function.
library(naniar) # Need to manually override vis_miss(artists,warn_large_data = FALSE)
vis_miss(artwork,warn_large_data = FALSE)
There is no missing data present in our data set as far as the id
and artistId
fields in the artwork
and artists
tables, so a full join with these fields should be fine.
Right?
Wrong.
artDataOJ<-artwork %>% full_join(artists,by = c("artistId"="id")) glimpse(artDataOJ) ## Rows: 69,395 ## Columns: 28 ## $ id1035, 1036, 1037, 1038, 1039, 1040, 1041, 1042, 1043, 1044, 1045, 1046, 1047, 1048, 1049, 1050, 1051, 1052,... ## $ accession_number "A00001", "A00002", "A00003", "A00004", "A00005", "A00006", "A00007", "A00008", "A00009", "A00010", "A00011... ## $ artist "Blake, Robert", "Blake, Robert", "Blake, Robert", "Blake, Robert", "Blake, William", "Blake, William", "Bl... ## $ artistRole "artist", "artist", "artist", "artist", "artist", "artist", "artist", "artist", "artist", "artist", "artist... ## $ artistId 38, 38, 38, 38, 39, 39, 39, 39, 39, 39, 39, 39, 39, 39, 39, 39, 39, 39, 39, 39, 39, 39, 39, 39, 39, 39, 39,... ## $ title "A Figure Bowing before a Seated Old Man with his Arm Outstretched in Benediction. Verso: Indecipherable Sk... ## $ dateText "date not known", "date not known", "?c.1785", "date not known", "1826–7, reprinted 1892", "1826–7, reprint... ## $ medium "Watercolour, ink, chalk and graphite on paper. Verso: graphite on paper", "Graphite on paper", "Graphite o... ## $ creditLine "Presented by Mrs John Richmond 1922", "Presented by Mrs John Richmond 1922", "Presented by Mrs John Richmo... ## $ year NA, NA, 1785, NA, 1826, 1826, 1826, 1826, 1826, 1826, 1826, 1828, 1825, 1825, 1825, 1825, 1825, 1825, 1825,... ## $ acquisitionYear 1922, 1922, 1922, 1922, 1919, 1919, 1919, 1919, 1919, 1919, 1919, 1919, 1919, 1919, 1919, 1919, 1919, 1919,... ## $ dimensions "support: 394 x 419 mm", "support: 311 x 213 mm", "support: 343 x 467 mm", "support: 318 x 394 mm", "image:... ## $ width 394, 311, 343, 318, 243, 240, 242, 246, 241, 243, 236, 184, 197, 197, 200, 198, 198, 198, 199, 198, 198, 19... ## $ height 419, 213, 467, 394, 335, 338, 334, 340, 335, 340, 340, 150, 151, 153, 152, 152, 153, 153, 150, 152, 152, 15... ## $ depth NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,... ## $ units "mm", "mm", "mm", "mm", "mm", "mm", "mm", "mm", "mm", "mm", "mm", "mm", "mm", "mm", "mm", "mm", "mm", "mm",... ## $ inscription NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,... ## $ thumbnailCopyright NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,... ## $ thumbnailUrl "http://www.tate.org.uk/art/images/work/A/A00/A00001_8.jpg", "http://www.tate.org.uk/art/images/work/A/A00/... ## $ url.x "http://www.tate.org.uk/art/artworks/blake-a-figure-bowing-before-a-seated-old-man-with-his-arm-outstretche... ## $ name "Blake, Robert", "Blake, Robert", "Blake, Robert", "Blake, Robert", "Blake, William", "Blake, William", "Bl... ## $ gender "Male", "Male", "Male", "Male", "Male", "Male", "Male", "Male", "Male", "Male", "Male", "Male", "Male", "Ma... ## $ dates "1762–1787", "1762–1787", "1762–1787", "1762–1787", "1757–1827", "1757–1827", "1757–1827", "1757–1827", "17... ## $ yearOfBirth 1762, 1762, 1762, 1762, 1757, 1757, 1757, 1757, 1757, 1757, 1757, 1757, 1757, 1757, 1757, 1757, 1757, 1757,... ## $ yearOfDeath 1787, 1787, 1787, 1787, 1827, 1827, 1827, 1827, 1827, 1827, 1827, 1827, 1827, 1827, 1827, 1827, 1827, 1827,... ## $ placeOfBirth "London, United Kingdom", "London, United Kingdom", "London, United Kingdom", "London, United Kingdom", "Lo... ## $ placeOfDeath "London, United Kingdom", "London, United Kingdom", "London, United Kingdom", "London, United Kingdom", "Lo... ## $ url.y "http://www.tate.org.uk/art/artists/robert-blake-38", "http://www.tate.org.uk/art/artists/robert-blake-38",...
Seeing that our row size has increased by 194 rows is a little suspicious. What’s going on here?
Lets look at an inner join between these tables.
artDataIJ<-artwork %>% inner_join(artists,by = c("artistId"="id")) glimpse(artDataIJ) ## Rows: 69,195 ## Columns: 28 ## $ id1035, 1036, 1037, 1038, 1039, 1040, 1041, 1042, 1043, 1044, 1045, 1046, 1047, 1048, 1049, 1050, 1051, 1052,... ## $ accession_number "A00001", "A00002", "A00003", "A00004", "A00005", "A00006", "A00007", "A00008", "A00009", "A00010", "A00011... ## $ artist "Blake, Robert", "Blake, Robert", "Blake, Robert", "Blake, Robert", "Blake, William", "Blake, William", "Bl... ## $ artistRole "artist", "artist", "artist", "artist", "artist", "artist", "artist", "artist", "artist", "artist", "artist... ## $ artistId 38, 38, 38, 38, 39, 39, 39, 39, 39, 39, 39, 39, 39, 39, 39, 39, 39, 39, 39, 39, 39, 39, 39, 39, 39, 39, 39,... ## $ title "A Figure Bowing before a Seated Old Man with his Arm Outstretched in Benediction. Verso: Indecipherable Sk... ## $ dateText "date not known", "date not known", "?c.1785", "date not known", "1826–7, reprinted 1892", "1826–7, reprint... ## $ medium "Watercolour, ink, chalk and graphite on paper. Verso: graphite on paper", "Graphite on paper", "Graphite o... ## $ creditLine "Presented by Mrs John Richmond 1922", "Presented by Mrs John Richmond 1922", "Presented by Mrs John Richmo... ## $ year NA, NA, 1785, NA, 1826, 1826, 1826, 1826, 1826, 1826, 1826, 1828, 1825, 1825, 1825, 1825, 1825, 1825, 1825,... ## $ acquisitionYear 1922, 1922, 1922, 1922, 1919, 1919, 1919, 1919, 1919, 1919, 1919, 1919, 1919, 1919, 1919, 1919, 1919, 1919,... ## $ dimensions "support: 394 x 419 mm", "support: 311 x 213 mm", "support: 343 x 467 mm", "support: 318 x 394 mm", "image:... ## $ width 394, 311, 343, 318, 243, 240, 242, 246, 241, 243, 236, 184, 197, 197, 200, 198, 198, 198, 199, 198, 198, 19... ## $ height 419, 213, 467, 394, 335, 338, 334, 340, 335, 340, 340, 150, 151, 153, 152, 152, 153, 153, 150, 152, 152, 15... ## $ depth NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,... ## $ units "mm", "mm", "mm", "mm", "mm", "mm", "mm", "mm", "mm", "mm", "mm", "mm", "mm", "mm", "mm", "mm", "mm", "mm",... ## $ inscription NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,... ## $ thumbnailCopyright NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,... ## $ thumbnailUrl "http://www.tate.org.uk/art/images/work/A/A00/A00001_8.jpg", "http://www.tate.org.uk/art/images/work/A/A00/... ## $ url.x "http://www.tate.org.uk/art/artworks/blake-a-figure-bowing-before-a-seated-old-man-with-his-arm-outstretche... ## $ name "Blake, Robert", "Blake, Robert", "Blake, Robert", "Blake, Robert", "Blake, William", "Blake, William", "Bl... ## $ gender "Male", "Male", "Male", "Male", "Male", "Male", "Male", "Male", "Male", "Male", "Male", "Male", "Male", "Ma... ## $ dates "1762–1787", "1762–1787", "1762–1787", "1762–1787", "1757–1827", "1757–1827", "1757–1827", "1757–1827", "17... ## $ yearOfBirth 1762, 1762, 1762, 1762, 1757, 1757, 1757, 1757, 1757, 1757, 1757, 1757, 1757, 1757, 1757, 1757, 1757, 1757,... ## $ yearOfDeath 1787, 1787, 1787, 1787, 1827, 1827, 1827, 1827, 1827, 1827, 1827, 1827, 1827, 1827, 1827, 1827, 1827, 1827,... ## $ placeOfBirth "London, United Kingdom", "London, United Kingdom", "London, United Kingdom", "London, United Kingdom", "Lo... ## $ placeOfDeath "London, United Kingdom", "London, United Kingdom", "London, United Kingdom", "London, United Kingdom", "Lo... ## $ url.y "http://www.tate.org.uk/art/artists/robert-blake-38", "http://www.tate.org.uk/art/artists/robert-blake-38",...
Well, the inner join gives us 194 rows less than the original. Lets look at the rows not present in the inner join.
# went back to doing things in a more base R fashion because `filter()` was more confusing in this case. addedValues<-artDataOJ[!c(artDataOJ$artistId %in% artDataIJ$artistId),] glimpse(addedValues) ## Rows: 200 ## Columns: 28 ## $ id6652, 16597, 138, 88193, 88194, 88195, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ... ## $ accession_number "N04252", "T05029", "T07923", "T12066", "T12067", "T12068", NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,... ## $ artist "?British School", "Hullmandel, Charles", "Kabakov, Ilya", "The Leach Pottery (St. Ives, UK)", "The Leach P... ## $ artistRole "artist", "artist", "artist", "artist", "artist", "artist", NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,... ## $ artistId 19232, 5265, 3462, 12951, 12951, 12951, 3067, 14647, 5221, 15135, 11604, 12604, 14648, 2714, 13866, 18070, ... ## $ title "Portrait of a Gentleman, probably of the West Family", "Portrait of J.M.W. Turner (‘The Fallacy of Hope’),... ## $ dateText "1545–60", "published 1851", "1990", "c.1955", "c.1955", "c.1955", NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ... ## $ medium "Oil paint on oak", "Lithograph on paper", "Wooden construction, 9 doors, wooden ceiling props, 24 light bu... ## $ creditLine "Presented in memory of R.S. Holford and Sir George Holford by nine members of their family 1927", "Present... ## $ year 1545, 1851, 1990, 1955, 1955, 1955, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,... ## $ acquisitionYear 1927, 1988, 2003, 2005, 2005, 2005, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,... ## $ dimensions "support: 1330 x 785 mm frame: 1533 x 988 x 86 mm", "image: 328 x 225 mm", NA, "object: 100 x 230 x 230 mm"... ## $ width 1330, 328, NA, 100, 70, 70, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,... ## $ height 785, 225, NA, 230, 145, 145, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA... ## $ depth NA, NA, NA, 230, 145, 145, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ... ## $ units "mm", "mm", NA, "mm", "mm", "mm", NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N... ## $ inscription NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,... ## $ thumbnailCopyright NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,... ## $ thumbnailUrl "http://www.tate.org.uk/art/images/work/N/N04/N04252_8.jpg", "http://www.tate.org.uk/art/images/work/T/T05/... ## $ url.x "http://www.tate.org.uk/art/artworks/british-school-portrait-of-a-gentleman-probably-of-the-west-family-n04... ## $ name NA, NA, NA, NA, NA, NA, "Albers, Anni", "Amzalag, Michael", "Anonymous", "Ant Farm (Chip Lord, born 1944, D... ## $ gender NA, NA, NA, NA, NA, NA, "Female", "Male", NA, "Male", "Female", "Male", "Male", "Male", "Female", NA, NA, "... ## $ dates NA, NA, NA, NA, NA, NA, "1899–1994", "born 1967", NA, NA, "born 1935", "born 1978", "born 1968", "1727–1780... ## $ yearOfBirth NA, NA, NA, NA, NA, NA, 1899, 1967, NA, NA, 1935, 1978, 1968, 1727, 1967, 1740, 1910, 1923, 1959, 1973, 197... ## $ yearOfDeath NA, NA, NA, NA, NA, NA, 1994, NA, NA, NA, NA, NA, NA, 1780, NA, 1799, 1997, 1998, NA, NA, NA, NA, 1828, 180... ## $ placeOfBirth NA, NA, NA, NA, NA, NA, "Berlin, Deutschland", "Paris, France", NA, NA, "New York, United States", "Buffalo... ## $ placeOfDeath NA, NA, NA, NA, NA, NA, "Orange, United States", NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA... ## $ url.y NA, NA, NA, NA, NA, NA, "http://www.tate.org.uk/art/artists/anni-albers-3067", "http://www.tate.org.uk/art/...
By the way that it looks it seems be that the artwork
table was accounted for separately from the artists
table. Therefore, there is artwork that is present that does not have an artist linked to it. Similarly, there are artists which are not linked to in the artwork
table.
While there is a pretty good job done with the data, clearly, it hasn’t been fully audited, or we weren’t informed of this issue. Hence the additional 194 rows.
This result makes sense with the statement on the data cited in the readme.md:
“The dataset in this repository was last updated in October 2014. Tate has no plans to resume updating this repository[…]”
With this in mind, we will focus on using the tables separately until we need to use a joined table which will have us use the inner joined table.
Tate’s site says that they have artwork dating as far back as the 1500’s. Lets see what the distribution of artwork age looks like.
ggplot(data=artwork,mapping=aes(x=year,y=..density..))+ theme_hc()+ geom_histogram(bins = 50,fill="#A52A2A",color="#800000")+ geom_density(size=0.73,color="#00007F",alpha=0.1,adjust=2,position="stack")+ labs(x="Year",y=NULL)+ ggtitle("Distribution of Artwork Age")
This definitely looks like more of a bi-modal distribution with large amount of artwork from the first half of the 19th century and the second half of the 20th century.
After a little bit of research, here are some of my proposed reasons for this:
To verify these claims, lets examine the credit lines of the artwork. After a quick scroll through with View(artwork)
I see that here are some of the key words that are important to look at.
I’m sure there are other terms. Yet, for now lets use these as it appears that most of the data uses these terms in the credit line.
Lets get into extracting these terms and classifying them.
(Warning- I’m busting our Base R here because its more intuitive, if you can do this with mutate()
, comment below so I can check it out!)
library(stringr) bought<- c("Purchased","Acquired") donated <- c("Presented","Accepted","Bequeathed","Transferred") # a simplified field telling us what art has been bought and what has been donated boughtIndex<-str_detect(artwork$creditLine,paste(bought,collapse = "|")) donatedIndex<-str_detect(artwork$creditLine,paste(donated,collapse = "|")) # I get warnings and it is annoying- but supposedly this is a bug. I'm hiding it in my markdown output. artwork<-artwork %>% as.data.frame() artwork$acquisitionMethod[boughtIndex]<-"Bought" artwork$acquisitionMethod[donatedIndex]<-"Donated"
With that done, lets look at how many NA values we have in comparison with what we have done.
sum(is.na(artwork$acquisitionMethod)) ## [1] 12 sum(is.na(artwork$creditLine)) ## [1] 3
Wow! Not bad. With a little bit of exploring by looking at the difference we will add some patterns to categorize 8 of the credit lines for completeness.
One of them we will not categorize because the credit line reads the following.
“Partial purchase and partial gift from the American Fund for the Tate Gallery, courtesy of Julie and Lawrence Salander in memory of John Constable (1776-1837), Daphne Reynolds (1918-2002), Evelyn Joll (1925-2001) and Leslie Parris (1941-2000), 2012. With additional funds provided by Tate Patrons.”
This is something which cannot be really classified as being bought or a donation. So for our case we will leave it out.
bought<- c("Purchased","Acquired","Exchanged","uncovered") donated <- c("Presented", "Accepted", "Bequeathed", "Transferred", "Gift", "Commissioned", "Given") # a simplified field telling us what art has been bought and what has been donated boughtIndex<-str_detect(artwork$creditLine,paste(bought,collapse = "|")) donatedIndex<-str_detect(artwork$creditLine,paste(donated,collapse = "|")) artwork$acquisitionMethod[boughtIndex] <- "Bought" artwork$acquisitionMethod[donatedIndex] <- "Donated"
Now, lets look at the graphs to see if our hypotheses are true. Note that we are now looking at the the acquisition year of the artwork now.
ggplot(data=artwork %>% filter(!is.na(acquisitionMethod)))+ theme_hc()+ geom_histogram(mapping=aes(x=acquisitionYear, y=..density..),bins = 50, fill="#A52A2A", color="#800000")+ facet_wrap(~acquisitionMethod)+ labs(x="Aquisition Year",y="")+ ggtitle("Acquisition History of Art by Tate")
Based on these histograms it seems like for point 1 we are spot on that the initial donations were from Henry Tate’s large art collection. For points 2 and 4 a rise in both purchases and donations during the second half of the 20th century to present coincide nicely with the rise in popularity for contemporary art. For point 3 it appears that the Duveen family’s financial support was not focused on buying art (it looks like there was already plenty of art) and rather more likely on the upkeep of the gallery and event programming.
If anyone actually knows the history of the Tate museum, I am very interested in confirmation of these observations as it looks pretty convincing!
The medium
field is a very diverse in nature. While there is some structure to the descriptions there just over over 3400 unique descriptions written across the various pieces.
length(unique(artwork$medium)) ## [1] 3402
I thought this challenge would be impossible because of the multi-dimensionality of this field, but I saw some anchors that we can use to split our data.
There is also some noise here that would mess up our data.
NA
valuesWith this in mind lets try to clean this field and use it in our data set.
# Need to make this ignore case materials<-str_split(str_to_lower(artwork$medium),", |and| on | onto |On | Onto |with |\\. |; ") materialsdf<-list() # I'm using a for loop here because I'm working with a list and a vector for(i in 1:length(materials)){ materialsdf[[i]]<-data.frame(materials=str_to_lower(materials[[i]]),year=artwork$year[i]) } # Now lets make a data frame materialsdf<-do.call(rbind,materialsdf) # Lets clean some names up. materialsdf$materials <- str_trim(materialsdf$materials) #Remove Verso materialsdf$materials<-str_remove_all(materialsdf$materials,"^\\d") materialsdf$materials<- str_remove(materialsdf$materials,"verso: ") #Remove numbers materialsdf$materials<-str_to_sentence(materialsdf$materials) glimpse(materialsdf) ## Rows: 152,081 ## Columns: 2 ## $ materials"Watercolour", "Ink", "Chalk", "Graphite", "Paper", "Graphite", "Paper", "Graphite", "Paper", "Graphite", "Paper", "... ## $ year NA, NA, NA, NA, NA, NA, NA, NA, NA, 1785, 1785, 1785, 1785, NA, NA, 1826, 1826, 1826, 1826, 1826, 1826, 1826, 1826, ...
We have some unknown years here. Because this isn’t helpful for learning about artistic mediums over time we will take them out. We also have some empty character values in the materials
field which we will take out as well.
To get a sense of popularity over time, we will also bin our data by century. This isn’t a perfect methodology because the history of trends in art is non linear and multi-dimensional, but nevertheless there is insight to find in this manner.
The data cleansing on this isn’t perfect, but I feel confident that we can list the top 5 mediums of expression from the 16th Century to the 21st Century based on the artwork from Tate. Because of varying frequencies, I am going to relate to the data in terms of proportion of all mediums employed in a given century.
# Make this into a tibble because the cool kids do it and filter the data groupedMaterials<-as_tibble(materialsdf) %>% filter(!is.na(year),!is.na(materials),materials!="") %>% count(material=materials, century = paste0(floor(year*0.01)*100,"s")) %>% arrange(century, -n) groupedData<-by(groupedMaterials,groupedMaterials["century"],tibble) centuryProp<- lapply(groupedData,function(x) x$n/sum(x$n)) # I'm using for loops because its easier to read. I'm reusing the same name to save memory! for (i in 1:length(groupedData)){ groupedData[[i]]<- tibble(groupedData[[i]],CenturyProp=centuryProp[[i]]) } # Repeating myself, but it works. groupedData<-do.call(rbind,groupedData) groupedData<- by(groupedData, groupedData["century"],head,n=5) groupedData<-do.call(rbind,groupedData) glimpse(groupedData) ## Rows: 30 ## Columns: 4 ## $ material"Oil paint", "Wood", "Oak", "Canvas", "Gouache", "Oil paint", "Canvas", "Paper", "Oak", "Graphite", "Paper", "Grap... ## $ century "1500s", "1500s", "1500s", "1500s", "1500s", "1600s", "1600s", "1600s", "1600s", "1600s", "1700s", "1700s", "1700s... ## $ n 14, 7, 5, 1, 1, 78, 64, 12, 8, 6, 3070, 2283, 909, 445, 414, 32285, 26215, 3814, 1387, 1325, 12881, 3306, 2938, 24... ## $ CenturyProp 0.43750000, 0.21875000, 0.15625000, 0.03125000, 0.03125000, 0.42622951, 0.34972678, 0.06557377, 0.04371585, 0.0327...
Now that we have our data grouped by the top 5 mediums for each century. Lets visualize our data:
ggplot(data=groupedData,mapping= aes(x=century,y=CenturyProp,fill=material))+ theme_hc()+ geom_bar(stat="identity",position="dodge")+ geom_text(mapping=aes(label=material), position=position_dodge(width=0.9), vjust=0.25,hjust=0,angle=90,size=3)+ theme(legend.position = "none")+ labs(x="Century",y="")+ ggtitle("Popular Artistic Mediums: 16th-21st Centuries")
From this visualization we can communicate the following:
Wow! looking at this data really gives us a feel for the history by seeing the various artistic mediums over time. These insights confirm the related history of British art over the given time periods as well.
To visualize the gender breakdown of artists over the centuries, we need to do similar transformations of our data that we did for the artistic mediums over time.
We will be using the inner joined data of the artists
and artwork
tables for the visualization as the number of pieces produced by a given artist is in the artwork
table and the gender information is in the artists
table.
genderTable<- artDataIJ %>% count(gender, century=paste0(floor(year*0.01)*100,"s")) %>% filter(century!="NAs") #Break down by genderTable<-by(genderTable, genderTable["century"],tibble) #get proportions CentProps<- lapply(genderTable,function(x) x$n/sum(x$n)) for (i in 1:length(genderTable)){ genderTable[[i]]<- tibble(genderTable[[i]],CenturyProp=CentProps[[i]]) } # Rebind the data genderTable<-do.call(rbind, genderTable)
Once our data is transformed we can visualize the gender proportion of the artists in our data set over time.
ggplot(genderTable,mapping=aes(x=century, y=CenturyProp,fill=gender))+ theme_hc()+ geom_bar(stat="identity")+ labs(y="",x="Century")+ ggtitle("Artist Gender Proportion: 16th-21st Centuries")
What we can communicate from this visualization is:
NA
values from the 16th-19th centuries were female, but there is no evidence to support this claim from data as it is presented.There is a lot that we can learn about history (and in our case- art) by looking at data. What was really amazing about this project was the ability to uncover the rich history present in this data set.
As “clean” as a data set may be (as stated in the readme.md), one may find themselves needing to do operations like string manipulation and data transformation to uncover the information that is necessary for a given analysis.
Big thank you to the R4DS community for making this data set available to explore!
Thanks for reading! I’ll see you in the next blog!
Be sure to subscribe and never miss an update!
The day you’ve all been waiting for is almost here: rstudio::global(2021) starts tomorrow! This 24-hour virtual event will focus on all things R and RStudio, featuring 50+ speakers from around the world, and it begins at 1600GMT (11 am EST, 8 am PST) on January 21, 2021.
If you want to attend rstudio::global, you don’t need a gold-embossed invitation, but you do still need to register, which you can do at rstudio.com/conference. You’ll be joining more than 12,000 attendees from 136 countries who are already registered and planning to attend.
But once you are registered, you are not done! After you’ve registered, you should plan your conference and enroll for the sessions you want to attend. Every talk will be given twice, so choose the talks that are most convenient for your local time. You’ll have three tracks of talks and more than 50 sessions to choose from, so planning in advance will ensure you don’t miss any of your favorite speakers. Should you need help converting the times shown into your local time, I recommend using the time zone converter at timeanddate.com. And just as like at rstudio::conf events in years past, stay tuned for social opportunities that will be announced once rstudio::global begins.
So don’t wait; start compiling your conference schedule now. The 24 hours of rstudio::global(2021) are less than a day away!
Thoughts
A previous post here detailed a simple function (from the uspols package) for extracting the timeline of the Trump presidency from Wikipedia. In this post, then, we turn this timeline into a Twitter ...
A previous post here detailed a simple function (from the uspols
package) for extracting the timeline of the Trump presidency from Wikipedia. In this post, then, we turn this timeline into a Twitter bot – one that remembers daily what happened four years ago in the Trump presidency.
A fairly trivial task; what is a bit tricky, however, is (1) dealing with Twitter’s 280 character limit, and (2) automating the posting of Twitter threads, as Wikipedia’s daily accounting of the last four years can be fairly detailed. Via the rtweet
package. For good measure, we demonstrate an unsupervised approach to adding hashtags to our tweet-threads in the form of named entities via spacy
/ spacyr
.
The Trump timeline can be extracted using the uspols::uspols_wiki_timeline()
function. A static table will shortly suffice.
# devtools::install_github("jaytimm/uspols") library(tidyverse) locals1 <- uspols::uspols_wiki_timeline() ## feature-izing Events -- locals1$nsent <- tokenizers::count_sentences(locals1$Events) locals1$nchar1 <- nchar(locals1$Events) ## -- locals1 <- subset(locals1, nchar1 > 0)
Details of table content returned from uspols_wiki_timeline()
: we consider the 699th day of the Trump presidency. A Thursday.
eg <- locals1 %>%filter(daypres == 699) eg %>% select(quarter:dow) %>% slice(1) %>% knitr::kable()
quarter | weekof | daypres | date | dow |
---|---|---|---|---|
2018_Q4 | 2018-12-16 | 699 | 2018-12-20 | Thursday |
As detailed below, Wikipedia summarizes the day’s happenings as individual “Events.” Some days were more eventful than others during the past four years. I have added a bullet
column for simple enumeration; day 699, then, included four events. Per above, we have added some Event-level features relevant to a tweet-bot: sentence count and character count.
eg %>% select(bullet, nsent, nchar1, Events) %>% DT::datatable(rownames = FALSE, options = list(dom = 't', pageLength = nrow(eg), scrollX = TRUE))
In this example, then, Events #3 and #4 will present problems from a character-limit perspective.
One approach to “tweet-sizing” Wikipedia-based events is simply to extract sentences up until we exceed some character threshold. An easy solution. Below we eliminate stops as markers of abbreviation (eg, honorifics Ms.
or Dr.
) – which makes sentence tokenizers infinitely more useful.
locals1$Events <- gsub('([A-Z])(\\.)', '\\1', locals1$Events)
The function below extracts sentences from a larger text until the cumulative character count exceeds some number, as specified by the chars
parameter. As we want to add some affixal matter to our tweet threads, we set this at 250.
extract_sent1 <- function(x, chars = 250) { z1 <- data.frame(ts = tokenizers::tokenize_sentences(x)[[1]]) z1$nchar_sent <- nchar(z1$ts) + 1 z1$cum_char <- cumsum(z1$nchar_sent) z2 <- subset(z1, cum_char < chars) paste0(z2$ts, collapse = ' ') }
The table below details the Events of Day 699 in tweet-ready length per our sentence extraction procedure. Per this method, we lose some (perhaps useful) detail from our thread.
Events <- unlist(lapply(eg$Events, extract_sent1, chars = 250)) data.frame(nsent = tokenizers::count_sentences(Events), nchar1 = nchar(Events), Events = Events) %>% DT::datatable(rownames = FALSE, options = list(dom = 't', pageLength = nrow(eg), scrollX = TRUE))
If, instead, we wanted to preserve full event content, one approach would be to split text into ~280 character chunks (ideally respecting word boundaries), and piece thread together via ellipses. The function below is taken directly from this SO post.
cumsum_reset <- function(x, thresh = 4) { ans <- numeric() i <- 0 while(length(x) > 0) { cs_over <- cumsum(x) ntimes <- sum( cs_over <= thresh ) x <- x[-(1:ntimes)] ans <- c(ans, rep(i, ntimes)) i <- i + 1 } return(ans) }
The second function implements cumsum_reset
in the context of counting characters at the word level – once a certain character-count threshold is reached, counting resets.
to_thread <- function(x, chars = 250){ # no thread counts at present -- x1 <- data.frame(text = unlist(strsplit(x, ' '))) x1$chars <- nchar(x1$text) + 1 x1$cs1 <- cumsum(x1$chars) x1$sub_text <- cumsum_reset(x1$chars, thresh = chars) x2 <- aggregate(x1$text, list(x1$sub_text), paste, collapse = " ") x2$ww <- 'm' x2$ww[1] <- 'f' x2$ww[nrow(x2)] <- 'l' x2$x <- ifelse(x2$ww %in% c('f', 'm'), paste0(x2$x, ' ...'), x2$x) x2$x <- ifelse(x2$ww %in% c('l', 'm'), paste0('... ', x2$x), x2$x) paste0(x2$x, collapse = ' || ') }
For a demonstration of how these functions work, we use a super long event from the Wikipedia timeline – from 12 August 2018 (day 573) – which is over 1200 characters in length.
eg1 <- locals1 %>% filter(nchar1 == max(nchar1))
Function output is summarized below. Also, we add a thread counter, and check in on character counts.
eg2 <- eg1 %>% select(Events) %>% mutate(Events = to_thread(Events)) %>% ##### --- !! separate_rows(Events, sep = ' \\|\\| ') eg3 <- eg2 %>% mutate(Events = paste0(Events, ' [', row_number(), ' / ', nrow(eg2), ']'), nchar1 = nchar(Events)) %>% select(nchar1, Events)
rtweet
We can piece together these different workflows as a simple script to be run daily via cron. The daily script (not detailed here) filters the Event timeline to the day’s date four years ago, and then applies our tweet re-sizing functions to address any potential character-count issues. The code below illustrates how threads are composed based on Event features (ie, sentence and character counts).
rowwise() %>% mutate (THREAD = case_when (nchar1 < 251 ~ Events, nchar1 > 250 & nsent > 1 ~ extract_sent1(Events), nchar1 > 250 & nsent == 1 ~to_thread(Events)
Lastly, we build threads by looping through our list of tweet-readied events, replying to each previously posted tweet via the in_reply_to_status_id
parameter of the rtweet::post_tweet()
function.
rtweet::post_tweet(tred2$txt3[1], token = tk) if(nrow(tred1) > 1) { for(i in 2:length(tred2$txt3)) { Sys.sleep(1) last_tweet <- rtweet::get_timeline(user = 'MemoryHistoric')$status_id[1] rtweet::post_tweet(tred2$txt3[i], in_reply_to_status_id = last_tweet, token = tk) } } else {NULL}
See and follow MemoryHistoric on Twitter!
Many people in our community actively contribute to rOpenSci projects. Many others would like to contribute but aren’t sure how to go about it. Wanting to get involved in rOpenSci or “give back to open source” means different things to different people. Ideally, a person should be able to find a way to contribute that meets their needs and fits our mission. We created the rOpenSci Community Contributing Guide^{1} to make finding paths to contributing more transparent and sustainable.
We welcome contributors from a continuum of experience: from those who consider themselves newcomers exploring the landscape and trying to see where they fit, all the way to more experienced folks who know exactly what they have to offer. We welcome code and non-code contributions from people at any career stage, and in any sector.
. . . users pursuing their own “selfish” interests build collective value as an automatic byproduct. (Tim O’Reilly on The Architecture of Participation^{2})
Maybe you want to spend 30 minutes sharing your package use case in our public forum or reporting a bug, 1 hour learning by attending a Community Call, 5 hours reviewing an R package submitted for open peer review, or maybe you want to make an ongoing commitment to help maintain a package. Our guide will walk you through the many avenues available.
The guide has three chapters.
rOpenSci & Our Community shares our mission, what it means to be part of our community, and introduces you to some Humans of rOpenSci.
What brings you here? is framed around different motivations. To help you recognize yourself, we categorized what we think contributors want into: Discover; Connect; Learn; Build; Help. Each has a set of “I want to” statements like “I want to discover packages I can use to facilitate my research and access open data”. Clicking on any action under a statement takes you to Chapter 3 for a description of the relevant rOpenSci resource with details on how to contribute.
In our dreams, the rOpenSci Community Contributing Guide can be a model for other open source communities to enable more transparent and sustainable participation.
We’re always standing on the shoulders of giants, some of whom might not think of themselves that way. The motivation, approach, design, tone, and content of this guide was influenced by many who are cited in our bibliography.
Thank you to Julia Stewart Lowndes, Toby Hodges and the rOpenSci team for their input. This work was supported in part by a NumFOCUS Small Development Grant to rOpenSci.
Stefanie Butland, & Steffi LaZerte. (2020). rOpenSci Community Contributing Guide (Version v0.1.0). Zenodo. http://doi.org/10.5281/zenodo.4000532; Source available on GitHub ︎
Tim O’Reilly (2004). Open Source Paradigm Shift. https://www.oreilly.com/pub/a/tim/articles/paradigmshift_0504.html ︎
While it is well-known that data separation can cause infinite estimates in binary regression models, the same is also true for other models with a point mass at the bounday (typically at zero) such as Poisson and Tobit. It is shown why this happens and how it can be remedied with bias-reduced estimation, along with implementations in R.
Köll S, Kosmidis I, Kleiber C, Zeileis A (2021). “Bias Reduction as a Remedy to the Consequences of Infinite Estimates in Poisson and Tobit Regression”, arXiv:2101.07141, arXiv.org E-Print Archive. https://arXiv.org/abs/2101.07141
Data separation is a well-studied phenomenon that can cause problems in the estimation and inference from binary response models. Complete or quasi-complete separation occurs when there is a combination of regressors in the model whose value can perfectly predict one or both outcomes. In such cases, and such cases only, the maximum likelihood estimates and the corresponding standard errors are infinite. It is less widely known that the same can happen in further microeconometric models. One of the few works in the area is Santos Silva and Tenreyro (2010) who note that the finiteness of the maximum likelihood estimates in Poisson regression depends on the data configuration and propose a strategy to detect and overcome the consequences of data separation. However, their approach can lead to notable bias on the parameter estimates when the regressors are correlated. We illustrate how bias-reducing adjustments to the maximum likelihood score equations can overcome the consequences of separation in Poisson and Tobit regression models.
R package brglm2
from CRAN: https://CRAN.R-project.org/package=brglm2
R package brtobit
from R-Forge: https://R-Forge.R-project.org/R/?group_id=2305
The simplest but arguably often-encountered occurrence of data separation in practice is when there is a binary regressor such that the response y = 0 (or another boundary value) whenever the regressor is 1. If P(y = 0) is monotonically increasing in the linear predictor of the model, then the coefficient of the binary regressor will diverge to minus infinity in order to push P(y = 0) in this subgroup as close to 1 as possible.
To illustrate this phenomenon in R for both Poisson and Tobit regression we employ a simple data-generating process: In addition to the intercept we generate a continuous regressor x_{2} uniformly distributed on [-1, 1] and a binary regressor x_{3}. The latter comes from a Bernoulli distribution with probability 0.25 if x_{2} is positive and with probability 0.75 otherwise. Thus, x_{2} and x_{3} are correlated.
The linear predictor employed for both Poisson and Tobit is: 1 + x_{2} – 10 x_{3}, where the extreme coefficient of -10 assures that there is almost certainly data separation. In the full paper linked above we also consider less extreme scenarios where separation may or may not occur. The Poisson response is then drawn from a Poisson distribution using a log link between mean and linear predictor. The Tobit response is drawn from a normal distribution censored at zero with identity link and constant variance of 2. Here, we draw two samples with 100 observations from both models:
dgp <- function(n = 100, coef = c(1, 1, -10, 2), prob = 0.25, dist = "poisson") { x2 <- runif(n, -1, 1) x3 <- rbinom(n, size = 1, prob = ifelse(x2 > 0, prob, 1 - prob)) y <- switch(match.arg(tolower(dist), c("poisson", "tobit")), "poisson" = rpois(n, exp(coef[1] + coef[2] * x2 + coef[3] * x3)), "tobit" = rnorm(n, mean = coef[1] + coef[2] * x2 + coef[3] * x3, sd = sqrt(coef[4])) ) y[y <= 0] <- 0 data.frame(y, x2, x3) } set.seed(2020-10-29) d1 <- dgp(dist = "poisson") set.seed(2020-10-29) d2 <- dgp(dist = "tobit")
Both of these data sets exhibit quasi-complete separation of y with respect to x_{3}, i.e., y is always 0 if x_{3} is 1.
xtabs(~ x3 + factor(y == 0), data = d1) ## factor(y == 0) ## x3 FALSE TRUE ## 0 47 8 ## 1 0 45
We then compare four different modeling approaches in this situation:
For Poisson regression, all these models can be fitted with the standard glm()
function in R. To obtain the BR estimate method = "brglmFit"
can be plugged in using the brglm2
package (by Ioannis Kosmidis).
install.packages("brglm2") library("brglm2") m12_ml <- glm(y ~ x2 + x3, data = d1, family = poisson) m12_br <- update(m12_ml, method = "brglmFit") m1_all <- glm(y ~ x2, data = d1, family = poisson) m1_sub <- update(m1_all, subset = x3 == 0) m1 <- list("ML" = m12_ml, "BR" = m12_br, "ML/sub" = m1_sub, "ML/SST" = m1_all)
This yields the following results (shown with the wonderful modelsummary package):
library("modelsummary") msummary(m1)
ML | BR | ML/sub | ML/SST | |
---|---|---|---|---|
(Intercept) | 0.951 | 0.958 | 0.951 | 0.350 |
(0.100) | (0.099) | (0.100) | (0.096) | |
x2 | 1.011 | 1.006 | 1.011 | 1.662 |
(0.158) | (0.157) | (0.158) | (0.144) | |
x3 | -20.907 | -5.174 | ||
(2242.463) | (1.416) | |||
Num.Obs. | 100 | 100 | 55 | 100 |
Log.Lik. | -107.364 | -107.869 | -107.364 | -169.028 |
The following remarks can be made:
Moreover, in more extensive simulation experiments in the paper it is shown that the BR estimates are always finite, and result in Wald-type intervals with better coverage probabilities.
Analogous results an be obtained for Tobit regression with our brtobit
package, currently available from R-Forge. This provides both ML and BR estimation for homoscedastic tobit models. (Some tools are re-used from our crch
package that implements various estimation techniques , albeit not BR, for Tobit models with conditional heteroscedasticity.) Below we fit the same four models as in the Poisson case above.
install.packages("brtobit", repos = "http://R-Forge.R-project.org") library("brtobit") m22_ml <- brtobit(y ~ x2 + x3, data = d2, type = "ML", fsmaxit = 28) m22_br <- brtobit(y ~ x2 + x3, data = d2, type = "BR") m2_all <- brtobit(y ~ x2, data = d2, type = "ML") m2_sub <- update(m2_all, subset = x3 == 0) m2 <- list("ML" = m22_ml, "BR" = m22_br, "ML/sub" = m2_sub, "ML/SST" = m2_all)
Because brtobit
does not yet provide a direct interface for modelsummary
(via broom
) we go through the coeftest()
results as an intermediate step. These can then be rendered by modelsummary
:
library("lmtest") m2 <- lapply(m2, coeftest) msummary(m2)
ML | BR | ML/sub | ML/SST | |
---|---|---|---|---|
(Intercept) | 1.135 | 1.142 | 1.135 | -0.125 |
(0.208) | (0.210) | (0.208) | (0.251) | |
x2 | 0.719 | 0.705 | 0.719 | 2.074 |
(0.364) | (0.359) | (0.364) | (0.404) | |
x3 | -11.238 | -4.218 | ||
(60452.270) | (0.891) | |||
(Variance) | 1.912 | 1.970 | 1.912 | 3.440 |
(0.422) | (0.434) | (0.422) | (0.795) | |
Num.Obs. | 100 | 100 | 55 | 100 |
Log.Lik. | -87.633 | -88.101 | -87.633 | -118.935 |
The results show exactly the same pattern as for the Poisson regression above: ML, BR, and ML/sub yield results close to the true coefficients for intercept, x_{2}, and the variance while the ML/SST estimates are far from the true values. For x_{3} only the BR estimates are finite while the ML estimates diverge towards minus infinity. Actually, the estimates would have diverged even more if we hadn’t stopped the Fisher scoring early (via fsmaxit = 28
instead of the default 100
).
Overall this clearly indicates that bias-reduced (BR) estimation is a convenient way to avoid infinite estimates and standard errors in these models and to enable standard inference even when data separation occurs. In contrast the common recommendation to omit the regressor associated with the separation should be avoided or applied to the non-separated subset of observations only. Otherwise it can give misleading results when regressors are correlated.
You can read the original post in its original format on Rtask website by ThinkR here: {attachment} v0.2.0 : find dependencies in your scripts and fill package DESCRIPTION
New version of {attachment} has been published on CRAN ! We continue to improve its functionalities to help you deal with dependencies in R, in particular during package development.
Since our last blog post presenting the aim of {attachment}, two versions reached CRAN. The full documentation is available on the {pkgdown} website : https://thinkr-open.github.io/attachment.
Still, for package developers, att_amend_desc()
keeps being your best friend after each change in your package. Just before devtools::check()
as we explain in our guide for package development along with documentation.
CRAN version
install.packages("attachment")
Development version
# install.packages("devtools") devtools::install_github("ThinkR-open/attachment")
Breaking changes
att_to_description()
deprecated in favor of att_amend_desc()
to be first in auto-completion list, as this is the most used function of this packageatt_from_rmd()
gets parameter inline = TRUE
by default to explore calls for packages in inline R code.att_from_rmd()
and att_from_rmds()
are not anymore executed in separate R session by default. You must set inside_rmd = TRUE
to do so.Minor
find_remotes()
to help fill Remotes field in DESCRIPTIONatt_to_desc_from_is()
add parameter normalize
to avoid problem with {desc}. (See https://github.com/r-lib/desc/issues/80)Find out how to fill DESCRIPTION for your {bookdown} in the documentation: https://thinkr-open.github.io/attachment/articles/fill-pkg-description.html#for-bookdown-1
att_amend_desc()
is an alias for att_to_description()
att_desc_from_is()
amends DESCRIPTION file from imports/suggests vector of packagesatt_to_desc_from_pkg()
is an alias for att_to_description()
att_to_description()
shows packages added/removed from DESCRIPTIONatt_to_description()
deals with dependencies in tests/ directoryatt_from_rmds()
allows user defined regex to detect Rmd filesThis post is better presented on its original ThinkR website here: {attachment} v0.2.0 : find dependencies in your scripts and fill package DESCRIPTION
The estimated vaccine efficacies from the phase III clinical trials provide good information of how strongly a Covid-19 vaccine reduces the probability of a symptomatic Covid-19 infection in the vaccinated individual, e. g. 95% for the Biontech/Pfizer vaccine. However, the probability that a vaccinated person still becomes infected without symptoms and can infect other non-vaccinated persons seems still largely unknown. That is an important number e.g. for deciding whether it is still legally defensible to impose the same shut-down restrictions on vaccinated persons as on others.
Well, I don’t know how to estimate this number with the publicly available data. Yet being curious, I asked myself how one could in principle estimate this number if one had data from a mobile phone Corona app that tracks past contacts and would also store whether a contact was vaccinated, and if not, was likely infected with the Corona virus at the time of the meeting (which may be calculated from later arriving test results). I did not check the existing literature, which may have much better approaches, but just rearranged some mathematical expressions and run the resulting estimation approach on simulated data.
Let’s first have a glimpse at the simulated data set:
set.seed(1) dat = simulate.meeting.data(init.sick.share=0.3,infect.prob = 0.1, vipe=0.7,n=500000,m=1000000, prob.unobserved = 0.1) head(dat)
subjectid | infected | m | ms | mv |
---|---|---|---|---|
1 | 0 | 6 | 1 | 3 |
2 | 0 | 3 | 0 | 1 |
3 | 0 | 5 | 1 | 1 |
4 | 0 | 2 | 0 | 1 |
5 | 0 | 3 | 2 | 0 |
6 | 0 | 3 | 0 | 1 |
The dummy variable infected
indicates whether during a considered period the subject got infected with Covid-19. Every row corresponds to an unvaccinated subject. The variable m
shall be the number of recorded relevant meetings before the infection. The number ms
indicates how many of those meetings were with an unvaccinated person that had Covid-19. The variable mv
indicates the number of meetings with a person that was vaccinated. For the met vaccinated individuals we don’t observe whether or not they had a Corona infection without symptoms.
The code for the simulate.meeting.data
function is given at the end of this post. Here is just a short description of the specified parameters (they are not calibrated to the real world):
init.sick.share=0.3
means that 30% of the initial population are infected with Corona. For simplicity we assume that the vaccinated population has the same probability to be infected, but they typically have no symptoms. You could also interpret an infected vaccinated person as a person that would be infected if she had not gotten a vaccination.
infect.prob
(10%) is the probability that an unvaccinated infected person infects a healthy unvaccinated person given that they meet each other.
vipe
(70%) shall stand for the vaccine's infectiousness prevention efficacy
. This is the number, we want to estimate. I do not know the proper name for it so I just invented a name and like the abbreviation VIPE. We assume that an infected vaccinated person infects in a meeting another person only with probability infect.prob*(1-vipe)
.
n
and m
are the number of subjects and number of simulated meetings.
prob.unobserved
(10%) is the probability that a meeting is not observed in our data set. Unobserved meetings may still lead to infections.
Let us start with some theoretical thoughts that will help us to estimate vipe
. Using shorter mathematical notation, let
be the probability that an unvaccinated infectious person infects another unvaccinated person given that there is a meeting. Let
\[\pi_v = \text{infect.prob*(1-vipe)*init.sick.share}\]be the probability that a vaccinated person ($\pi_v$) who may have or may not have a symptomless corona infection infects an unvaccinated person given that they meet.
The probability that after the recorded number of meetings $m^s_i$ and $m^v_i$ an initially healthy person $i$ gets infected shall be given by:
\[Pr(\text{i infected})=1-(1-\pi_v)^{m^v_i} \cdot (1-\pi_s)^{m^s_i} \cdot U_i\]That is 1 minus the probability that no meeting leads to an infection. The term $U_i$ shall stand for the probability of infections from meetings that are unobserved in our data set. We also assume that the infection probabilities of all meetings are independent from each other.
The logarithm of the probability that an initially healthy vaccinated person $i$ does not become infected by any meeting therefore is
\[\log Pr(\text{i not infected}) = \log(1-\pi_v) \cdot m^v_i + \log(1-\pi_s) \cdot {m^s_i} + \log U_i\]Let us rename the terms on the right hand side as following:
\[\beta_1 = \log(1-\pi_v)\] \[\beta_2 = \log(1-\pi_s)\]where $\beta_1$ and $\beta_2$ are coefficients that we will estimate. We also rewrite
\[\log U_i = \beta_0 + u_i\]where $\beta_0$ is the average of the log-probability to get infected from unobserved meetings and $u_i$ is an unobservable deviation for each individual.
We can then write:
\[\log Pr(\text{i not infected}) = \beta_0 + \beta_1 \cdot m^v_i + \beta_2 \cdot {m^s_i} + u_i\]The right hand side looks like that of a nice linear regression equation. Unfortunately, the left hand side contains a probability that we don’t directly observe. One route to proceed could be to assume a distribution for the $u_i$ (or perhaps even just set it equal to 0) derive the log-likelihood function and then compute the Maximum Likelihood estimator.
Yet, I choose another route: I aggregate the individual-level data and then run a weighted least squares regression.
agg = dat %>% group_by(mv, ms) %>% summarize( n = n(), prob.infected = mean(infected), y = log(1-prob.infected) ) %>% ungroup() head(agg)
mv | ms | n | prob.infected | y |
---|---|---|---|---|
0 | 0 | 46300 | 0.01 | -0.01 |
0 | 1 | 42400 | 0.11 | -0.11 |
0 | 2 | 15900 | 0.19 | -0.21 |
0 | 3 | 3960 | 0.28 | -0.33 |
0 | 4 | 747 | 0.33 | -0.4 |
0 | 5 | 117 | 0.48 | -0.65 |
The data frame agg
contains one row for each combination of meeting numbers mv
and ms
and we computed in prob.infected
the average share of infected persons in each such cell. We then use the variable y=log(1-prob.infected)
as an estimator for the left hand side in our regression specification:
We now estimate this regression using the aggregated data frame agg
with the numbers of subjects n
in each cell as weights:
agg = filter(agg, is.finite(y)) reg = lm(y~mv+ms, data=agg, weights=agg$n) beta = coef(reg) beta ## (Intercept) mv ms ## -0.008729151 -0.009242340 -0.104527982
The estimated coefficients themselves are not very informative. We will now convert them to the variables of interest:
names(beta) = NULL pi.v = 1-exp(beta[2]) pi.s = 1-exp(beta[3]) init.sick.share = 0.3 vipe = 1-((pi.v/ init.sick.share) / pi.s) c(pi.s = pi.s, pi.v = pi.v, vipe = vipe) ## pi.s pi.v vipe ## 0.099250407 0.009199761 0.691025251
We estimate a 9.9% probability to get infected when meeting an infected unvaccinated person. This is close to our assumed parameter infect.prob
of 10%. When meeting a vaccinated person, we only estimate a 0.9% probability to get infected. Note that our data does not distinguish between vaccinated persons with an unobservable infection and non-infected ones. To estimate the vaccine infection prevention efficacy (VIPE), we must therefore must adjust for the share of initially infected persons. We get a VIPE estimate of 69.1%, which is not too far off the true value of 70%. Correct standard errors could (hopefully) be obtained via bootstrapping. I repeated the simulation a 100 times and find a mean estimate for VIPE of 0.7002 with a standard deviation of 0.02. So, at least with the simulated data, this estimation approach seems to work.
In a real world data set, one would have to think about possible sources of bias, however. For example, the behavior and safety procedures may systematically differ depending on whether a person is vaccinated or not. It could also be the case that the number unobserved meetings is correlated with individual characteristics that are also correlated with the explanatory variables, yielding a classic endogeneity problem in our regression.
Most importantly, I don’t know if there exists a country that collects such data, would allow to analyze it and at the same time has sufficiently many Covid-19 cases for a sufficiently precise estimation. But even if the presented regression may never be run on real data, I find it helpful to think about possible statistical approaches to estimate such important measures. For example, it may help to better weigh the costs-and-benefits of stronger data protection vs valuable insights from more data collection.
Overall, I am curious whether we will learn about how infectious vaccinated persons are and which data set and empirical approach corresponding studies will take…
simulate.meeting.data = function( init.sick.share = 0.2, # share of initially sick people infect.prob = 0.1, # probability vipe = 0.7, # vaccine infection prevention efficacy vacc.share = 0.3, # share of vaccinated people prob.unobserved = 0, # probability that a meeting is unobserved n = 100000, # number of persons m = 100000 # number of meetings ) { # Draw persons that meet m1 = sample.int(n,2*m, replace=TRUE) m2 = sample.int(n,2*m, replace=TRUE) # Only keep first m meetings # where different persons meet keep = m1 != m2 m1 = m1[keep][1:m] m2 = m2[keep][1:m] # Which persons are vaccinated vacc = (runif(n) <= vacc.share) # Which persons are initially infected # for vaccinated persons that may be the hypothetical # infection if they would not be vaccinated start.sick = (runif(n) <= init.sick.share) # Simulate which meetings would transfer an infection infect.dice = runif(m) infect1.threshold = infect.prob*ifelse(vacc[m2], 1-vipe,1) infect2.threshold = infect.prob*ifelse(vacc[m1], 1-vipe,1) infected1 = (infect.dice <= infect1.threshold) & start.sick[m2] infected2 = (infect.dice <= infect2.threshold) & start.sick[m1] # Which subjects are sick after all infections post.sick1 = m1[infected1] post.sick2 = m2[infected2] post.sick = start.sick post.sick[c(post.sick1,post.sick2)] = TRUE # Is the meeting recorded / observed record = runif(m) > prob.unobserved # in every meeting let s be sender # and r be the receiver of a potential # infection # use dtplyr for speed library(dtplyr) dat = bind_rows( tibble(s = m1, r = m2, record=record), tibble(s = m2, r = m1, record=record) ) %>% filter(record) %>% mutate( start.sick.s = start.sick[s], start.sick.r = start.sick[r], post.sick.s = post.sick[s], post.sick.r = post.sick[r], infected = !start.sick.r & post.sick.r, vacc.s = vacc[s], vacc.r = vacc[r] ) %>% # only keep people that are initially sick filter(!start.sick.r) %>% # aggregate meeting data on person level lazy_dt() %>% group_by(r) %>% summarize( infected = 1L*first(infected), m = n(), ms = sum(start.sick.s & !vacc.s), mv = sum(vacc.s) ) %>% rename(subjectid = r) %>% as_tibble() dat }
How lucrative stocks are in the long run is not only dependent on the length of the investment period but even more on the actual date the investment starts and ends!
Return Triangle Plots are a great way to visualize this phenomenon. If you want to learn more about them and how to create them with R read on!
If you had invested in the Standard & Poors 500 index beginning of 2000 you would have had to wait 14 years until you were in the plus! The reason was, of course, the so-called dot-com bubble which was at its peak then and crashed soon afterwards. On the other hand, if you had invested in the same index beginning of 2003 you would have never had any loss below your initial investment and would have a return of more than 300% by now!
Return triangle plots are a great way to get to grips with this. The following function returns a return triangle for any ticker symbol (Symbol
) for any start (from
) and end year (to
). For retrieving the stock or index data we use the wonderful quantmod
package (on CRAN):
library(quantmod) ## Loading required package: xts ## Loading required package: zoo ## ## Attaching package: 'zoo' ## The following objects are masked from 'package:base': ## ## as.Date, as.Date.numeric ## Loading required package: TTR ## Registered S3 method overwritten by 'quantmod': ## method from ## as.zoo.data.frame zoo ## Version 0.4-0 included new data defaults. See ?getSymbols. return_triangle <- function(Symbol = "^GSPC", from = 2000, to = 2020) { symbol <- getSymbols(Symbol, from = paste0(from, "-01-01"), to = paste0(to, "-12-31"), auto.assign = FALSE) symbol_y <- coredata(to.yearly(symbol)[ , c(1, 4)]) from_to <- seq(from, to) M <- matrix(NA, nrow = length(from_to), ncol = length(from_to)) rownames(M) <- colnames(M) <- from_to for (buy in seq_along(from_to)) { for (sell in seq(buy, length(from_to))) { M[buy, sell] <- (symbol_y[sell, 2] - symbol_y[buy, 1]) / symbol_y[buy, 1] } } round(100 * M, 1) } rt <- return_triangle(from = 2009, to = 2020) ## 'getSymbols' currently uses auto.assign=TRUE by default, but will ## use auto.assign=FALSE in 0.5-0. You will still be able to use ## 'loadSymbols' to automatically load data. getOption("getSymbols.env") ## and getOption("getSymbols.auto.assign") will still be checked for ## alternate defaults. ## ## This message is shown once per session and may be disabled by setting ## options("getSymbols.warning4.0"=FALSE). See ?getSymbols for details. print(rt, na.print = "") ## 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 ## 2009 23.5 39.3 39.3 57.9 104.7 128.0 126.4 147.9 196.1 177.6 257.8 313.3 ## 2010 12.6 12.6 27.7 65.5 84.4 83.1 100.5 139.5 124.5 189.4 234.2 ## 2011 0.0 13.4 47.0 63.7 62.5 78.0 112.6 99.3 156.9 196.8 ## 2012 13.3 46.8 63.6 62.4 77.8 112.4 99.1 156.6 196.5 ## 2013 29.6 44.4 43.3 57.0 87.5 75.8 126.5 161.7 ## 2014 11.5 10.7 21.3 44.8 35.8 75.0 102.2 ## 2015 -0.7 8.7 29.9 21.8 56.9 81.3 ## 2016 9.8 31.2 23.0 58.5 83.1 ## 2017 18.7 11.3 43.5 65.8 ## 2018 -6.6 20.4 39.1 ## 2019 30.4 50.7 ## 2020 15.0
The rows represent the buy dates (beginning of the respective year), the columns the sell dates (end of the respective year). To create a return triangle plot out of that data we use the fantastic plot.matrix
package (on CRAN):
library(plot.matrix) rt <- return_triangle(from = 2000, to = 2020) bkp_par <- par(mar = c(5.1, 4.1, 4.1, 4.1)) # adapt margins plot(rt, digits = 1, text.cell = list(cex = 0.5), breaks = 15, col = colorRampPalette(c("red", "white", "green1", "green2", "green3", "green4", "darkgreen")), na.print = FALSE, border = NA, key = NULL, main = "S&P 500", xlab = "sell", ylab = "buy") par(bkp_par)
As it stands some care needs to be taken with setting the breaks
and col
arguments when creating your own triangle plots. It might help to remove key = NULL
so that you can see in the legend whether values below zero are in red and above in green. If you know some elegant method to set those values automatically please share it with us in the comments below. I will update the post with an honourable mention of you!
Back to the triangle plot itself: you can clearly see how whole periods form clusters of positive and negative returns… like a heat map with green hills and red valleys.
In the long run, all investments get into the green but with huge differences of sometimes several hundred percentage points! So even for long-term investors timing indeed is important but as every quant knows unfortunately very, very hard (if not outright impossible).
Missing values used to drive me nuts… until I learned how to impute them! In 10-minutes, learn how to visualize and impute ...
The post How to Handle Missing Data first appeared on R-bloggers.]]>This article is part of a R-Tips Weekly, a weekly video tutorial that shows you step-by-step how to do common R coding tasks.
Missing values used to drive me nuts… until I learned how to impute them! In 10-minutes, learn how to visualize and impute in R using ggplot dplyr and 3 more packages to simple imputation.
Here are the links to get set up.
We’re going to kick the tires on 3 key packages:
visdat
– For quickly visualizing datananiar
– For working with NA’s (missing data)simputation
– For simple imputation (converting missing data to values)So let’s get started!
It doesn’t get any easier than this. Simply use visdat::vis_miss()
to visualize the missing data. We can see Ozone and Solar.R are the offenders.
Use Case: It often makes sense to evaluate the interactions between columns containing missing data. We can use an “upset” plot for this.
Start with a good question:
“Is it often that we have both Ozone and Solar.R missing at the same time?”
We can answer this with gg_miss_upset()
. We can see that 2 of 5 Solar.R (40%) happen at the same observation that Ozone is missing. Might want to check for IOT sensor issues!
Use Case: This is a great before/after visual.
For our final exploratory plot, let’s plot the missing data using geom_miss_point()
. It works just like geom_point(), but plots where the missing data are located in addition to the non-missing data.
The simputation library comes with a host of impute*()_ functions. We’ll focus on impute_rf()
, which implements a random forest to do the imputation.
This imputes the NA’s, replacing the missing Ozone and Solar.R data. We can see the missing data follows the distribution of the non-missing data in the updated scatter plot.
Reminders:
Time for an air-guitar celebration with your co-worker.
Here’s how to master R programming and become powered by R.
What happens after you learn R for Business.
When your CEO gets word of your Shiny Apps saving the company $$$.
This is career acceleration.
Sign Up to Get the R-Tips Weekly (You’ll get email notifications of NEW R-Tips as they are released): https://mailchi.mp/business-science/r-tips-newsletter
Set Up the GitHub Repo: https://github.com/business-science/free_r_tips
Check out the setup video (https://youtu.be/F7aYV0RPyD0). Or, Hit Pull in the Git Menu to get the R-Tips Code
Once you take these actions, you’ll be set up to receive R-Tips with Code every week. =)
Top R-Tips Tutorials you might like:
Pairwise comparisons are one of the most debated topic in agricultural research: they are very often used and, sometimes, abused, in literature. I have nothing against the appropriate use of this very useful technique and, for those who are interested, some colleagues and I have given a bunch of (hopefully) useful suggestions in a paper, a few years ago (follow this link here).
Pairwise comparisons usually follow the application of some sort of linear or generalised linear model; in this setting, the ‘emmeans’ package (Lenth, 2020) is very handy, as it uses a very logical approach. However, we can find ourselves in the need of making pairwise comparisons between the elements of a vector, which does not came as the result of linear model fitting.
For example, we may happen to have an old table of means with standard errors and have lost the original raw data. Or, we may happen to have a vector of parameters from a nonlinear regression model, fitted with the nls()
function. How do we make pairwise comparisons? Experienced users can make profit of the glht()
function in the ‘multcomp’ package, although this is not immediate and, at least for me, it takes always some attempts to recall the exact syntax.
Therefore, I have built the pairComp()
wrapper, which is available within the ‘aomisc’ package, the accompanying package for this website. Let’s see how this function works by using a typical example.
This is a real-life example, taken from a research published by Vischetti et al. in 1996 (we have used this example in other posts, before). That research considered three herbicides for weed control in sugar beet, i.e. metamitron (M), phenmedipham (P) and chloridazon (C). Four soil samples were contaminated, respectively with: (i) M alone, (ii) M + P (iii) M + C and (iv) M + P + C. The aim was to assess whether the degradation speed of metamitron in soil depended on the presence of co-applied herbicides. To reach this aim, the soil samples were incubated at 20°C and sub-samples were taken in different times after the beginning of the experiment. The concentration of metamitron in those sub-samples was measured by HPLC analyses, performed in triplicate. The resulting dataset is available within the ‘aomisc’ package.
In the box below. we install the ‘aomisc’ package from gitHub (if necessary), load it and load the ‘metamitron’ dataset.
# library(devtools) # install_github("OnofriAndreaPG/aomisc") library(aomisc) data(metamitron) head(metamitron) ## Time Herbicide Conc ## 1 0 M 92.00 ## 2 0 M 118.64 ## 3 0 M 89.58 ## 4 7 M 59.32 ## 5 7 M 62.95 ## 6 7 M 62.95 #... #... tail(metamitron) ## Time Herbicide Conc ## 91 55 MPC 35.75 ## 92 55 MPC 37.83 ## 93 55 MPC 27.41 ## 94 67 MPC 23.38 ## 95 67 MPC 28.41 ## 96 67 MPC 18.92
The first step we take is to fit a first-order degradation model, as follows:
\[C_{t, h} = A_h \, \exp \left(-k_h \, t \right)\]
where \(C\) is the concentration at time \(t\) for metamitron in the \(h^{th}\) combination (M alone, M + P, M + C and M + P + C), \(A\) is the initial concentration for the metamitron in the \(h^{th}\) combination, \(k\) is the degradation rate for metamitron in the \(h^{th}\) combination. This model is nonlinear and, therefore, we can use the nls()
function for nonlinear least squares regression. The code is given below: please, note that the two parameters are followed by the name of the factor variable in square brackets (i.e.: A[Herbicide] and k[Herbicide]). This is necessary, to fit a different parameter value for each level of the ‘Herbicide’ factor.
#Fit nls grouped model modNlin <- nls(Conc ~ A[Herbicide] * exp(-k[Herbicide] * Time), start=list(A=rep(100, 4), k=rep(0.06, 4)), data=metamitron) summary(modNlin) ## ## Formula: Conc ~ A[Herbicide] * exp(-k[Herbicide] * Time) ## ## Parameters: ## Estimate Std. Error t value Pr(>|t|) ## A1 9.483e+01 4.796e+00 19.77 <2e-16 *** ## A2 1.021e+02 4.316e+00 23.65 <2e-16 *** ## A3 9.959e+01 4.463e+00 22.31 <2e-16 *** ## A4 1.116e+02 4.184e+00 26.68 <2e-16 *** ## k1 4.260e-02 4.128e-03 10.32 <2e-16 *** ## k2 2.574e-02 2.285e-03 11.26 <2e-16 *** ## k3 3.034e-02 2.733e-03 11.10 <2e-16 *** ## k4 2.186e-02 1.822e-03 12.00 <2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 9.701 on 88 degrees of freedom ## ## Number of iterations to convergence: 5 ## Achieved convergence tolerance: 7.136e-06
We can retrieve the degradation rates for the four herbicides (k1, k2, k3 and k4) together with standard errors and load them into two vectors, as shown in the box below. In order to make pairwise comparisons, we also need to retrieve an estimate of the residual degrees of freedom, which we can also extract from the model fit object.
tab <- summary(modNlin) dRates <- tab$coef[5:8,1] SEs <- tab$coef[5:8,2] dfr = tab$df[2] dRates ## k1 k2 k3 k4 ## 0.04260044 0.02573512 0.03033803 0.02185935 SEs ## k1 k2 k3 k4 ## 0.004128447 0.002284696 0.002733498 0.001822218 dfr ## [1] 88
Now we have one vector of estimates to be compared and one vector of standard errors. In this situation, we can make pairwise comparisons by using the pairComp()
function in the ‘aomisc’ package. We just have to pass the vector of model parameters, the vector of standard errors, and, optionally, the names of parameters (we do not need this, as ‘dRates’ is a named vector), the number of residual degrees of freedom (defaults to ‘Inf’) and the multiplicity adjustment method, as in the ‘multcomp’ package (defaults to “single-step”).
cp <- pairComp(dRates, SEs, dfr = dfr, adjust = "holm") cp$pairs ## ## Simultaneous Tests for General Linear Hypotheses ## ## Linear Hypotheses: ## Estimate Std. Error t value Pr(>|t|) ## k1-k2 == 0 0.016865 0.004718 3.574 0.00286 ** ## k1-k3 == 0 0.012262 0.004951 2.477 0.04604 * ## k1-k4 == 0 0.020741 0.004513 4.596 8.58e-05 *** ## k2-k3 == 0 -0.004603 0.003563 -1.292 0.37639 ## k2-k4 == 0 0.003876 0.002922 1.326 0.37639 ## k3-k4 == 0 0.008479 0.003285 2.581 0.04604 * ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## (Adjusted p values reported -- holm method)
We can also obtain a letter display, by taking the ‘Letters’ slot in the ‘cp’ object. In this case, we might like to change the yardstick protection level, by passing the ‘level’ argument in ‘pairComp()’, that defaults to 0.05.
cp$Letters ## Mean SE CLD ## k1 0.04260044 0.004128447 a ## k2 0.02573512 0.002284696 bc ## k3 0.03033803 0.002733498 b ## k4 0.02185935 0.001822218 c
Please, note that the pairComp()
function can be flexibly used in every situation where we have a vector of estimates and a vector of standard errors. It yields correct results whenever the elements of the vector of estimates are uncorrelated. Hope this is useful. Thanks for reading!
Prof. Andrea Onofri
Department of Agricultural, Food and Environmental Sciences
University of Perugia (Italy)
email: andrea.onofri@unipg.it
Blog www.statforbiology.com
References
We at RStudio are excited to host our first fully virtual conference this week, rstudio::global. We were so pleased to have so many of you join us this week, and while we wish we could see you ...
The post 2020 at RStudio: A Year in Review first appeared on R-bloggers.]]>Photo by Kelly Sikkema on Unsplash
We at RStudio are excited to host our first fully virtual conference this week, rstudio::global. We were so pleased to have so many of you join us this week, and while we wish we could see you all in person again, we were happy to have this opportunity to come together with the open source data science community. We will share a recap of the conference in after it concludes this Friday.
Before we dive back into our projects, I thought it would be a good time to look back at what kept us busy in 2020. While the year presented many challenges for everyone, we were pleased to continue to support and deliver value to the R and Python data science community. Below, I list some of the many highlights of the past year. No doubt I have missed a few, but these are some of the things I am particularly proud we were able to accomplish last year.
Our company grew significantly this year, despite the many challenges posed by COVID-19. As part of that growth, we:
On the product side, we significantly enhanced the capabilities of both our commercial and open source products. Specifically, we:
RStudio also expanded its wealth of free and open-source packages available to the larger data science community in 2020. Some of the significant development included:
And of course, the tidyverse team was as productive as always this year, releasing (among other things) upgrades to:
There were also significant updates to:
The team also launched tidymodels.org, a central location for learning and using the tidymodels
packages.
Finally, in support of online scientific and technical communication, we introduced the 1.0 version of the distill package, as well as real-time visual editing of R Markdown documents. We also introduced many other updates and enhancements to R Markdown.
2020 was a busy year, and I am sure there are still a dozen things I missed. I know it’s difficult to keep up with everything RStudio is doing, but hopefully the links I’ve included above will help. If you’d like to learn more about any of the professional products, please drop a line to sales@rstudio.com, or book at time talk with us using this link.
If you’d particularly like to learn more about the many ways that RStudio provides a single home for data science teams using R and Python, we encourage you to register for our upcoming webinar on February 3rd.
Here’s to a happy, healthy, and productive 2021 for all of us!
A visual markdown editor that provides improved produc...
The post Announcing RStudio 1.4 first appeared on R-bloggers.]]>RStudio is excited to announce that we released RStudio 1.4 today! The many features of RStudio 1.4 will already be familiar to regular readers of this blog. These include:
You can read the complete set of new features and bug fixes in the RStudio 1.4 release notes.
We’ve written before about how a world class IDE is a requirement for any team that wants to do serious data science. However, whenever a new release of an installed product comes out, it always raises the question, “What value will my team get from upgrading?” For data science teams committed to serious data science, we believe RStudio 1.4 is a must-have upgrade because it:
We believe RStudio 1.4’s support for a single development environment that supports multiple language platforms will help address many data science challenges that teams will face in 2021. We expect that many organizations will find it to be the key to unlocking new value from their data science investments.
Open source and commercial versions of RStudio Desktop and RStudio Server Pro 1.4 are available for download today from the RStudio products page.
If you are just starting out with RStudio and want to learn more about the features available in the RStudio IDE, we invite you to browse the the most popular RStudio IDE features.
To receive email notifications for RStudio professional product releases, patches, security information, and general product support updates, subscribe to the Product Information list by visiting the RStudio subscription management portal.
Brief BaseSet history
I study diseases to try to find what causes them at a research institute associated with an hospital.
Thanks to recent technological ...
In this post I will explain the history behind BaseSet then a brief introduction to sets, followed by showing what you can do with BaseSet.
I study diseases to try to find what causes them at a research institute associated with an hospital. Thanks to recent technological advances we can analyze many things from a single patient’s sample. Having so much information available can be overwhelming, making it difficult to find the causes of diseases (but it is much better than not having enough information!).
In order to make it easier to focus and aggregate knowledge, scientists have been storing what we know of each gene in several ways. As more information was gathered, efforts to group these genes into those to focus on or discard as causes of different processes of interest started to emerge^{1}. Scientists categorize genes according to where they are, what do they do, how they function, and with which other genes they interact. There are also many different sub-groups within these categories.
This has lead to several ways of storing this information, and depending on which database and which method you use you end up with different groups. This in turn means that we aren’t so sure of how to group genes and as new experiments are carried out these groupings might change.
I had previously created a package (BioCor) to compare these gene sets which returns a numerical value of how similar two groups of genes are. But I found out that depending on which database I used I got very different results. So I wanted an easy way to know in which groups a gene was present, but I also wanted this information unified from several data sources to account for this uncertainty.
I started experimenting by extending a class present on GSEABase that I had used previously. I also found out that there were some efforts by other people through the Bioconductor slack channel. So I got in touch with them and we shared some ideas on how to do this.
To explore different ways to solve the problem we ended up developing three different packages that aim to do similar things but take different approaches. At the moment BaseSet published on CRAN and BiocSet is published on Bioconductor. unisets is available on GitHub although development has stalled due to time constraints. These packages provide methods to work with sets, the mathematical name for groups of elements.
Sets are not only used for tennis games.
In mathematics if you have a group of multiple elements such as c("apple", "bannana", "pineaple")
it is called a set.
A set is a group of elements, usually just unique elements are used (no c("apple", "bannana", "pineaple", "apple")
).
To learn and understand the world we group elements together. Usually it is clear what should be inside a group, like the elements above, all of them are fruits. Sometimes it is not so clear if an element belongs to a certain group: Are tomatoes fruits?
An element can also belong to several groups. Basketball is a team sport, but is also played with a ball. So basketball can be in both groups, as we see on Wikipedia:
Categories: Basketball | Sports originating in the United States | Team sports | Summer Olympic sports | Ball games | Games and sports introduced in 1891
Some say that tomatoes should be both considered as fruits and as vegetables, so they could also be in both sets. We can do this for any element we can think of, pencils, mathematical operations, numbers, words, genes…
When we have more than one set we might be interested in which elements are in two groups, which fruits are also vegetables, or which fruits are not vegetables.
These operations are called set operations.
Every time you have used %in%
or merge
in your code you have been using a set operation.
There are many set operations, the most commonly used are “intersection”, for which R has the intersect
function, and “union”, which is done with c
on vectors.
The equivalents on data.frame
s are done with merge
and its arguments (if you are familiar with the *_join
verbs from dplyr, the many arguments of merge
are one of the reasons why there are so many *_join
functions).
As you see R already provides some functions to work with sets. So why write a new package for sets? Let’s see it with some code:
First load the package:
library("BaseSet") packageVersion("BaseSet") [1] '0.0.14'
In order to work with sets the package defines its own class named TidySet
.
We can transform our data with the tidySet
function:
sets <- list(fruits = c("apple", "bannana", "pineaple")) TS <- tidySet(sets) TS elements sets fuzzy 1 apple fruits 1 2 bannana fruits 1 3 pineaple fruits 1
We can explore some properties of this set:
nElements(TS) [1] 3 nSets(TS) [1] 1
then add more fruits
TS1 <- add_elements(TS, "orange") nElements(TS1) [1] 4 TS1 elements sets fuzzy 1 apple fruits 1 2 bannana fruits 1 3 pineaple fruits 1
We can see it has one more element but it isn’t displayed, that’s is because we didn’t record to what set this element belongs to:
relation <- data.frame(sets = "fruits", elements = "orange") TS1 <- add_relation(TS, relation) TS1 elements sets fuzzy 1 apple fruits 1 2 bannana fruits 1 3 pineaple fruits 1 4 orange fruits 1
As you might have noticed there is a third column named fuzzy that so far is always 1. This column contains a value between 0 and 1 and represents the membership of an element to that set. A value of 1 indicates full membership and 0 that there isn’t any relationship. Values between encode the uncertainty. We will demonstrate this later, but first let’s create a new set to do some operations with it:
relation <- data.frame(sets = c("vegetables", "fruits"), elements = c("tomatoes", "tomatoes")) TS2 <- add_relation(TS1, relation) TS2 elements sets fuzzy 1 apple fruits 1 2 bannana fruits 1 3 pineaple fruits 1 4 orange fruits 1 5 tomatoes vegetables 1 6 tomatoes fruits 1
Now that we have two sets we can perform more operations:
intersection(TS2, c("vegetables", "fruits")) elements sets fuzzy 1 tomatoes vegetables∩fruits 1 union(TS2, c("vegetables", "fruits")) elements sets fuzzy 1 apple vegetables∪fruits 1 2 bannana vegetables∪fruits 1 3 pineaple vegetables∪fruits 1 4 orange vegetables∪fruits 1 5 tomatoes vegetables∪fruits 1
Note that the names of the resulting sets have been automatically set up using the appropriate symbols.
So far we have set up how to count and merge different sets but one of the goals of BaseSet is to also encode the uncertainty when we assign an element to a set.
To account for uncertainty mathematicians have come up with the idea of fuzzy sets. Fuzzy sets are sets but instead of knowing for sure that an element belongs to the set, there is a value, the fuzzy value or membership, that measures the uncertainty of an element belonging to a set.
A typical example involves ratings. Imagine classification methods that, given a picture, classify the probability of each object being in the picture. It might return probabilities such as:
classification <- matrix(c(0.1, 0.2, 0.6, 0.85, 0.6, 0.5, 0.25, 0.16), nrow = 2, ncol = 4, byrow = TRUE, dimnames = list(c("image1", "image2"), c("Cat", "Dog", "Duck", "Fish"))) classification # Algorithm classification Cat Dog Duck Fish image1 0.1 0.2 0.60 0.85 image2 0.6 0.5 0.25 0.16 TS <- tidySet(classification) # Transform data TS elements sets fuzzy 1 image1 Cat 0.10 2 image2 Cat 0.60 3 image1 Dog 0.20 4 image2 Dog 0.50 5 image1 Duck 0.60 6 image2 Duck 0.25 7 image1 Fish 0.85 8 image2 Fish 0.16
We can calculate how probable it is that the image is classified to just one object.
element_size(TS) elements size probability 1 image1 0 0.0432 2 image1 1 0.3252 3 image1 2 0.4802 4 image1 3 0.1412 5 image1 4 0.0102 6 image2 0 0.1260 7 image2 1 0.3810 8 image2 2 0.3620 9 image2 3 0.1190 10 image2 4 0.0120
Given this memberships there is just a 0.3252 probability that the machine would classify the first image as just one thing, so it is more probable that the first image will be misclassified^{2}.
For the second image it is more probable to contain just one object than two, but there is still high uncertainty as the probability to classify to none of the 4 objects is 0.1260 ^{3}.
This was not evident from the “output” of the classification but is equally important to know in many situations.
I submitted this package to rOpenSci peer-review because I wanted more feedback on these fuzzy sets methods and wasn’t sure of the mix I made of S4 methods and S3 methods.
The topic is not a common one so although I submitted the package on January for review it started on June (after 2 months of hiatus due to COVID). This long waiting time occurred because it was hard for the editor (thanks Anna Krystalli!) to find two reviewers with all that was going on.
Once two reviewers were found the feedback started coming (Many thanks Zebulun Arendsee and Jennifer Chang!). The reviewers pointed out inconsistencies on the order of the output, suggested more methods, pointed out mismatches between the documentation and the code.
Most importantly much of the comments and exchange was around some methods and explanations for fuzzy sets which were not clear^{4}.
I made my best efforts to improve the package and hope you’ll find it useful in your projects. Try it out!
See this article reviewing the different methods used. ︎
We add up the probability of being one feature (and not the others, i.e.: (1- x)
) for all the elements: 0.1*(1-0.2)*(1-0.6)*(1-0.85) + (1-0.1)*0.2*(1-0.6)*(1-0.85) + (1-0.1)*(1-0.2)*0.6*(1-0.85) + (1-0.1)*(1-0.2)*(1-0.6)*0.85
︎
Similar to the other calculation, this probability is (1-0.6)*(1-0.5)*(1-0.25)*(1-0.16)
. ︎
If there is still something unclear; let me know! Open an issue. ︎
As we evaluate therapies for COVID-19 to help improve outcomes during the pandemic, researchers need to be able to make recommendations as quickly as possible. There really is no time to lose. The Data & Safety Monitoring Board (DSMB) of COMPIL...
The post Finding answers faster for COVID-19: an application of Bayesian predictive probabilities first appeared on R-bloggers.]]>As we evaluate therapies for COVID-19 to help improve outcomes during the pandemic, researchers need to be able to make recommendations as quickly as possible. There really is no time to lose. The Data & Safety Monitoring Board (DSMB) of COMPILE, a prospective individual patient data meta-analysis, recognizes this. They are regularly monitoring the data to determine as quickly as possible if there is a sufficiently strong signal to indicate effectiveness of convalescent plasma (CP) for hospitalized patients not on ventilation.
How much data is enough to draw a conclusion? We know that at some point in the next few months, many if not all of the studies included in the meta-analysis will reach their target enrollment, and will stop recruiting new patients; at that point, the meta-analysis data set will be complete. Before that end-point, an interim DSMB analysis might indicate there is a high probability that CP is effective although it does not meet the pre-established threshold of 95%. If we know the specific number of patients that will ultimately be included in the final data set, we can predict the probability that the findings will put us over that threshold, and possibly enable a recommendation. If this probability is not too low, the DSMB may decide it is worth waiting for the complete results before drawing any conclusions.
Predicting the probability of success (or futility) is done using the most recent information collected from the study, which includes observed data, the parameter estimates, and the uncertainty surrounding these estimates (which is reflected in the posterior probability distribution).
This post provides an example using a simulated data set to show how this prediction can be made.
In this example, the outcome is the WHO 11-point ordinal scale for clinical status at 14 days, which ranges from 0 (uninfected and out of the hospital) to 10 (dead), with various stages of severity in between. As in COMPILE, I’ll use a Bayesian proportional odds model to assess the effectiveness of CP. The measure of effectiveness is an odds ratio (OR) that compares the cumulative odds of having a worse outcome for the treated group compared to the cumulative odds for the control group:
\[ \text{Cumulative odds for level } k \text{ in treatment arm } j =\frac{P(Y_{ij} \ge k)}{P(Y_{ij} \lt k)}, \ k \in \{1,\dots, 10\} \]
The goal is to reduce the odds of having a bad outcome, so a successful therapy is one where \(OR \lt 1\). In a Bayesian context, we estimate the posterior probability distribution of the \(OR\) (based on prior assumptions before we have collected any data). We will recommend the therapy in the case that most of the probability density lies to the left of 1; in particular we will claim success only when \(P(OR \lt 1) > 0.95\). For example, in the figure the posterior distribution on top would lead us to consider the therapy successful since 95% of the density falls below 1, whereas the distribution on the bottom would not:
This data set here is considerably simpler than the COMPILE data that has motivated all of this. Rather than structuring this example as a multi-study data set, I am assuming a rather simple two-arm design without any sort of clustering. I am including two binary covariates related to sex and age. The treatment in this case reduces the odds of worse outcomes (or increases the odds of better outcomes). For more detailed discussion of generating ordinal outcomes, see this earlier post (but note that I have flipped direction of cumulative probability in the odds formula).
library(simstudy) library(data.table) def1 <- defDataAdd(varname="male", formula="0.7", dist = "binary") def1 <- defDataAdd(def1, varname="over69", formula="0.6", dist = "binary") def1 <- defDataAdd(def1, varname="z", formula="0.2*male + 0.3*over69 - 0.3*rx", dist = "nonrandom") baseprobs <- c(0.10, 0.15, 0.08, 0.07, 0.08, 0.08, 0.11, 0.10, 0.09, 0.08, 0.06) RNGkind("L'Ecuyer-CMRG") set.seed(9121173) dd <- genData(450) dd <- trtAssign(dd, nTrt = 2, grpName = "rx") dd <- addColumns(def1, dd) dd <- genOrdCat(dd, adjVar = "z", baseprobs = baseprobs, catVar = "y")
Here is a plot of the cumulative proportions by treatment arm for the first 450 patients in the (simulated) trial. The treatment arm has more patients with lower WHO-11 scores, so for the most part lies above the control arm line. (This may be a little counter-intuitive, so it may be worthwhile to think about it for a moment.)
With the data from 450 patients in hand, the first step is to estimate a Bayesian proportional odds model, which I am doing in Stan
. I use the package cmdstanr
to interface between R
and Stan
.
Here is the model:
\[ \text{logit}\left(P(y_{i}) \ge k \right) = \tau_k + \beta_1 I(\text{male}) + \beta_2 I(\text{over69}) + \delta T_i, \ \ \ k \in \{ 1,\dots,10 \} \]
where \(T_i\) is the treatment indicator for patient \(i\), and \(T_i = 1\) when patient \(i\) receives CP. \(\delta\) represents the log odds ratio, so \(OR = e^{\delta}\). I’ve included the Stan code for the model in the the first addendum.
library(cmdstanr) dt_to_list <- function(dx) { N <- nrow(dx) ## number of observations L <- dx[, length(unique(y))] ## number of levels of outcome y <- as.numeric(dx$y) ## individual outcome rx <- dx$rx ## treatment arm for individual x <- model.matrix(y ~ factor(male) + factor(over69), data = dx)[, -1] D <- ncol(x) list(N=N, L=L, y=y, rx=rx, x=x, D=D) } mod <- cmdstan_model("pprob.stan") fit <- mod$sample( data = dt_to_list(dd), seed = 271263, refresh = 0, chains = 4L, parallel_chains = 4L, iter_warmup = 2000, iter_sampling = 2500, step_size = 0.1 ) ## Running MCMC with 4 parallel chains... ## ## Chain 4 finished in 52.3 seconds. ## Chain 1 finished in 52.9 seconds. ## Chain 3 finished in 53.4 seconds. ## Chain 2 finished in 55.6 seconds. ## ## All 4 chains finished successfully. ## Mean chain execution time: 53.6 seconds. ## Total execution time: 55.8 seconds.
Once the model is fit, our primary interest is whether we can make a recommendation about the therapy. A quick check to verify if \(P(OR < 1) > 0.95\) confirms that we are not there yet.
library(posterior) draws_df <- as_draws_df(fit$draws()) draws_dt <- data.table(draws_df[-grep("^yhat", colnames(draws_df))]) mean(draws_dt[, OR < 1]) ## [1] 0.89
A plot that shows the bottom 95% portion of the density in blue makes it clear that the threshold has not been met:
We have collected complete data from 450 patients out of an expect 500, though we are not yet ready declare success. An interesting question to ask at this point is “given what we have observed up until this point, what is the probability that we will declare success after 50 additional patients are included in the analysis?” If the probability is sufficiently high, we may decide to delay releasing the inconclusive results pending the updated data set. (On the other hand, if the probability is quite low, there may be no point in delaying.)
The prediction incorporates three potential sources of uncertainty. First, there is the uncertainty regarding the parameters, which is described by the posterior distribution. Second, even if we knew the parameters with certainty, the outcome remains stochastic (i.e. not pre-determined conditional on the parameters). Finally, we don’t necessarily know the characteristics of the remaining patients (though we may have some or all of that information if recruitment has been finished but complete follow-up has not).
In the algorithm that follows - the steps follow from these three elements of uncertainty:
We repeat this cycle, say 1000 times. The proportion of cycles that are counted as a success represents the predictive probability of success.
library(glue) dd_new <- dd[, .(id = sample(id, 25, replace = TRUE)), keyby = rx] dd_new <- merge(dd[, .(id, male, over69)], dd_new, by = "id") dd_new[, id:= (nrow(dd) + 1):(nrow(dd) +.N)]
draw <- as.data.frame(draws_dt[sample(.N, 1)])
The coefficients \(\hat{\beta}_1\) (male), \(\hat{\beta}_2\) (over69), and \(\hat{\delta}\) (treatment effect) are extracted from the draw from the posterior:
D <- dt_to_list(dd)$D beta <- as.vector(x = draw[, glue("beta[{1:D}]")], mode = "numeric") delta <- draw$delta coefs <- as.matrix(c(beta, delta)) coefs ## [,1] ## [1,] 0.22 ## [2,] 0.60 ## [3,] -0.19
Using the draws of the \(\tau_k\)’s, I’ve calculated the corresponding probabilities that can be used to generate the ordinal outcome for the new observations:
tau <- as.vector(draw[grep("^tau", colnames(draw))], mode = "numeric") tau <- c(tau, Inf) cprop <- plogis(tau) xprop <- diff(cprop) baseline <- c(cprop[1], xprop) baseline ## [1] 0.117 0.136 0.123 0.076 0.089 0.101 0.114 0.102 0.054 0.040 0.048
zmat <- model.matrix(~male + over69 + rx, data = dd_new)[, -1] dd_new$z <- zmat %*% coefs setkey(dd_new, id) dd_new <- genOrdCat(dd_new, adjVar = "z", baseline, catVar = "y")
dx <- rbind(dd, dd_new)
fit_pp <- mod$sample( data = dt_to_list(dx), seed = 737163, refresh = 0, chains = 4L, parallel_chains = 4L, iter_warmup = 2000, iter_sampling = 2500, step_size = 0.1 ) ## Running MCMC with 4 parallel chains... ## ## Chain 2 finished in 66.5 seconds. ## Chain 4 finished in 67.3 seconds. ## Chain 3 finished in 67.5 seconds. ## Chain 1 finished in 68.1 seconds. ## ## All 4 chains finished successfully. ## Mean chain execution time: 67.4 seconds. ## Total execution time: 68.2 seconds.
draws_pp <- data.table(as_draws_df(fit_pp$draws())) draws_pp[, mean(OR < 1)] ## [1] 0.79
The next step is to pull all these elements together in a single function that we can call repeatedly to estimate the predictive probability of success. This probability is estimated by calculating the proportion of iterations that result in a success.
Computing resources required for this estimation might be quite substantial. If we iterate 1000 times, we need to fit that many Bayesian models. And 1000 Bayesian model estimates could be prohibitive - a high performance computing cluster (HPC) may be necessary. (I touched on this earlier when I describe exploring the characteristic properties of Bayesian models.) I have provided the code below in the second addendum in case any readers are interested in trying to implement on an HPC.
I’ll conclude with a figure that shows how predictive probabilities can vary depending on the observed sample size and \(P(OR < 1)\) for the interim data set. Based on the data generating process I’ve used here, if we observe a \(P(OR < 1) = 90\%\) at an interim look after 250 patients, it is considerably more probable that we will end up over 95% than if we observe that same probability at an interim look after 450 patients. This makes sense, of course, since the estimate at 450 patients will have less uncertainty, and adding 50 patients will not likely change to results dramatically. The converse is true after 250 patients.
Ultimately, the interpretation of the predictive probability will depend on the urgency of making a recommendation, the costs of waiting, the costs of deciding to soon, and other factors specific to the trial and those making the decisions.
data { intN; // number of observations int L; // number of WHO categories int y[N]; // vector of categorical outcomes int rx[N]; // treatment or control int D; // number of covariates row_vector[D] x[N]; // matrix of covariates N x D matrix } parameters { vector[D] beta; // covariate estimates real delta; // overall control effect ordered[L-1] tau; // cut-points for cumulative odds model ([L-1] vector) } transformed parameters{ vector[N] yhat; for (i in 1:N){ yhat[i] = x[i] * beta + rx[i] * delta; } } model { // priors beta ~ student_t(3, 0, 10); delta ~ student_t(3, 0, 2); tau ~ student_t(3, 0, 8); // outcome model for (i in 1:N) y[i] ~ ordered_logistic(yhat[i], tau); } generated quantities { real OR = exp(delta); }
library(slurmR) est_from_draw <- function(n_draw, Draws, dd_obs, D, s_model) { set_cmdstan_path(path = "/.../cmdstan/2.25.0") dd_new <- dd_obs[, .(id = sample(id, 125, replace = TRUE)), keyby = rx] dd_new <- merge(dd_obs[, .(id, male, over69)], dd_new, by = "id") dd_new[, id:= (nrow(dd_obs) + 1):(nrow(dd_obs) +.N)] draw <- as.data.frame(Draws[sample(.N, 1)]) beta <- as.vector(x = draw[, glue("beta[{1:D}]")], mode = "numeric") delta <- draw$delta coefs <- as.matrix(c(beta, delta)) tau <- as.vector(draw[grep("^tau", colnames(draw))], mode = "numeric") tau <- c(tau, Inf) cprop <- plogis(tau) xprop <- diff(cprop) baseline <- c(cprop[1], xprop) zmat <- model.matrix(~male + over69 + rx, data = dd_new)[, -1] dd_new$z <- zmat %*% coefs setkey(dd_new, id) dd_new <- genOrdCat(dd_new, adjVar = "z", baseline, catVar = "y") dx <- rbind(dd_obs, dd_new) fit_pp <- s_model$sample( data = dt_to_list(dx), refresh = 0, chains = 4L, parallel_chains = 4L, iter_warmup = 2000, iter_sampling = 2500, step_size = 0.1 ) draws_pp <- data.table(as_draws_df(fit_pp$draws())) return(data.table(n_draw, prop_success = draws_pp[, mean(OR < 1)])) } job <- Slurm_lapply( X = 1L:1080L, FUN = est_from_draw, Draws = draws_dt, dd_obs = dd, D = D, s_model = mod, njobs = 90L, mc.cores = 4L, job_name = "i_pp", tmp_path = "/.../scratch", plan = "wait", sbatch_opt = list(time = "03:00:00", partition = "cpu_short"), export = c("dt_to_list"), overwrite = TRUE ) job res <- Slurm_collect(job) rbindlist(res)[, mean(prop_success >= 0.95)]
A short code-golf challenge led me to learn about the Kempner series, which is the series made of the inverted integers, excluding all those containing the digit 9. Most surprisingly this exclusion is enough to see the series converging (close to 23). The explanation for this convergence is that, citing Wikipedia,
“The number of n-digit positive integers that have no digit equal to ‘9’ is 8 × 9^{n−1}“
and since the inverses of these n-digit positive integers are less than 10^{1−n} the series is bounded by 80. In simpler terms, it converges because the fraction of remaining terms in the series is geometrically decreasing as (9/10)^{1−n}. Unsurprisingly (?) the series is also atrociously slow to converge (for instance the first million terms sum up to 11) and there exist recurrence representations that speed up its computation. Here is the code-golf version
n=scan()+2 while(n<-n-1){ F=F+1/T while(grepl(9,T<-T+1))0} F
that led me to learn about the R function grepl. (The explanation for the pun in the title is that Semper Fidelis is the motto of the corsair City of Saint-Malo or Sant-Maloù, Brittany.)
Web Scraping, by nature requires a lot of understanding from the ability to find the css selector to rightly parse the scraped content. While there are a lot of R packages (even Python packages for that matter), {ralger} does a wonderful job of abst...
The post Simple Easy Beginners Web Scraping in R with {ralger} first appeared on R-bloggers.]]>Web Scraping, by nature requires a lot of understanding from the ability to find the css selector to rightly parse the scraped content. While there are a lot of R packages (even Python packages for that matter), {ralger}
does a wonderful job of abstracting the complicated things and providing a simple easy-to-use Beginner-friendly Web Scraping Package. {ralger}
has simple functions to quickly scrape / extract Title Text (H1, H2, H3), Tables, URLs, Images from the given web page.
Below is an example on how to scrape IMDB Website (for educational purposes) in R with {ralger}
#install.packages("ralger") library(ralger) link <- "https://www.imdb.com/chart/top" node <- "#main > div > span > div > div > div.lister > table > tbody > tr:nth-child(n) > td.titleColumn > a" extract <- scrap(link, node) img_links <- images_preview(link) imdb250 <- table_scrap(link) link <- "https://www.imdb.com/search/title/?groups=top_250&sort=user_rating" my_nodes <- c( ".lister-item-header a", # The title ".text-muted.unbold", # The year of release ".ratings-imdb-rating strong" # The rating) ) names <- c("title", "year", "rating") # respect the nodes order df_rank <- tidy_scrap(link = link, nodes = my_nodes, colnames = names)
596 episodes. 40 seasons. 1 package!
I’m a pretty big fan of Survivor and have religiously watched every season since the first. With 40 seasons under its belt, there’s a tonne of data to dive into. However, getting that data in one place has been tedious. Hence, the survivoR package.
survivoR is a collection of datasets detailing events across all 40 seasons of the US Survivor, including castaway information, vote history, immunity and reward challenge winners, jury votes, and viewers.
Currently, the package exists on Github and can be installed with the following code.
devtools::install_github("doehm/survivoR")
Cran: TBA
Below are all the datasets that are contained within the package.
A data frame containing summary details of each season of Survivor, including the winner, runner ups and location. This is a nested data frame given there maybe 1 or 2 runner-ups. By using a nested data frame the grain is maintained to 1 row per season.
season_summary #> # A tibble: 40 x 17 #> season_name season location country tribe_setup full_name winner runner_ups #>#> 1 Survivor: ~ 1 Pulau T~ Malays~ Two tribes~ Richa~ Richard
2 Survivor: ~ 2 Herbert~ Austra~ Two tribes~ Tina ~ Tina 3 Survivor: ~ 3 Shaba N~ Kenya Two tribes~ Ethan~ Ethan 4 Survivor: ~ 4 Nuku Hi~ Polyne~ Two tribes~ Vecep~ Vecepia 5 Survivor: ~ 5 Ko Taru~ Thaila~ Two tribes~ Brian~ Brian 6 Survivor: ~ 6 Rio Neg~ Brazil Two tribes~ Jenna~ Jenna 7 Survivor: ~ 7 Pearl I~ Panama Two tribes~ Sandr~ Sandra 8 Survivor: ~ 8 Pearl I~ Panama Three trib~ Amber~ Amber 9 Survivor: ~ 9 Efate, ~ Vanuatu Two tribes~ Chris~ Chris 10 Survivor: ~ 10 Koror, ~ Palau A schoolya~ Tom W~ Tom # ... with 30 more rows, and 9 more variables: final_vote , #> # timeslot , premiered , premier_viewers , ended , #> # finale_viewers , reunion_viewers , rank , viewers season_summary %>% select(season, viewers_premier, viewers_finale, viewers_reunion, viewers_mean) %>% pivot_longer(cols = -season, names_to = "episode", values_to = "viewers") %>% mutate( episode = to_title_case(str_replace(episode, "viewers_", "")) ) %>% ggplot(aes(x = season, y = viewers, colour = episode)) + geom_line() + geom_point(size = 2) + theme_minimal() + scale_colour_survivor(16) + labs( title = "Survivor viewers over the 40 seasons", x = "Season", y = "Viewers (Millions)", colour = "Episode" )
Season and demographic information about each castaway. Within a season the data is ordered by the first voted out to sole survivor indicated by order
which represents the order they castaways left the island. This may be by being voted off the island, being evacuated due to medical reasons, or quitting. When demographic information is missing, it likely means that the castaway re-entered the game at a later stage by winning the opportunity to return. Castaways that have played in multiple seasons will feature more than once with the age and location representing that point in time.
castaways %>% filter(season == 40) #> # A tibble: 22 x 15 #> season_name season castaway nickname age city state day original_tribe #>#> 1 Survivor: ~ 40 Natalie~ Natalie 2 Sele #> 2 Survivor: ~ 40 Amber M~ Amber 40 Pens~ Flor~ 3 Dakal #> 3 Survivor: ~ 40 Danni B~ Danni 43 Shaw~ Kans~ 6 Sele #> 4 Survivor: ~ 40 Ethan Z~ Ethan 45 Hill~ New ~ 9 Sele #> 5 Survivor: ~ 40 Tyson A~ Tyson 11 Dakal #> 6 Survivor: ~ 40 Rob Mar~ Rob 43 Pens~ Flor~ 14 Sele #> 7 Survivor: ~ 40 Parvati~ Parvati 36 Los ~ Cali~ 16 Sele #> 8 Survivor: ~ 40 Sandra ~ Sandra 44 Rive~ Flor~ 16 Dakal #> 9 Survivor: ~ 40 Yul Kwon Yul 44 Los ~ Cali~ 18 Dakal #> 10 Survivor: ~ 40 Wendell~ Wendell 35 Phil~ Penn~ 21 Dakal #> # ... with 12 more rows, and 6 more variables: merged_tribe , #> # result , jury_status , order , swapped_tribe , #> # swapped_tribe2
This data frame contains a complete history of votes cast across all seasons of Survivor. This allows you to see who voted for who at which tribal council. It also includes details on who had individual immunity as well as who had their votes nullified by a hidden immunity idol. This details the key events for the season.
While there are consistent events across the seasons such as the tribe swap, there are some unique events such as the ‘mutiny’ in Survivor: Cook Islands (Season 13) or the ‘Outcasts’ in Survivor: Pearl Islands (season 7). When castaways change tribes by some means other than a tribe swap, it is still recorded as ‘swapped’ to maintain a standard.
The data is recorded as ‘swapped’ with a trailing digit if a swap has occurred more than once. This includes absorbed tribes when 3 tribes are reduced to 2 or when Stephanie was ‘absorbed’ in Survivor: Palau (season 10) when everyone but herself was voted off the tribe (and making Palau one of the classic seasons of Survivor). To indicate a change in tribe status these events are also considered ‘swapped’.
This data frame is at the tribal council by castaway grain, so there is a vote for everyone that attended the tribal council. However, there are some edge cases such as when the ‘steal a vote’ advantage is played. In this case, there is a second row for the castaway indicating their second vote.
In the case of a tie and a revote, the first vote is recorded and the result is recorded as ‘Tie’. The deciding vote is recorded as normal. Where there is a double tie, it is recorded as ‘Tie2’ (for lack of a better name). In the case of a double tie and it goes to rocks, the vote is either ‘Black rock’ or ‘White rock’. In the older episodes of Survivor, when there were two ties in a row, rather than going to rocks there was a countback of votes.
vh <- vote_history %>% filter( season == 40, episode == 10 ) vh #> # A tibble: 9 x 11 #> season_name season episode day tribe_status castaway immunity vote #>#> 1 Survivor: ~ 40 10 25 merged Tony individ~ Tyson #> 2 Survivor: ~ 40 10 25 merged Michele Tyson #> 3 Survivor: ~ 40 10 25 merged Sarah Deni~ #> 4 Survivor: ~ 40 10 25 merged Sarah Tyson #> 5 Survivor: ~ 40 10 25 merged Ben Tyson #> 6 Survivor: ~ 40 10 25 merged Nick Tyson #> 7 Survivor: ~ 40 10 25 merged Kim Soph~ #> 8 Survivor: ~ 40 10 25 merged Sophie Deni~ #> 9 Survivor: ~ 40 10 25 merged Tyson Soph~ #> # ... with 3 more variables: nullified , voted_out , order vh %>% count(vote) #> # A tibble: 5 x 2 #> vote n #> #> 1 Denise 2 #> 2 Immune 1 #> 3 None 1 #> 4 Sophie 2 #> 5 Tyson 5
Events in the game such as fire challenges, rock draws, steal-a-vote advantages, or countbacks (in the early days) often mean a vote wasn’t placed for an individual. Rather a challenge may be won, lost, no vote cast, etc but attended tribal council. These events are recorded in the vote
field. I have included a function clean_votes
for when only the votes cast for individuals are needed. If the input data frame has the vote
column it can simply be piped.
vh %>% clean_votes() %>% count(vote) #> # A tibble: 3 x 2 #> vote n #>#> 1 Denise 2 #> 2 Sophie 2 #> 3 Tyson 5
A nested tidy data frame of immunity challenge results. Each row in this dataset is a tribal council. It is a nested data frame since there may be multiple people or tribes that win immunity. But more so multiple tribes when there are 3 or more tribes in the first phase of the game. You can extract the immunity winners by expanding the data frame. There may be duplicates for the rare event when there are multiple eliminations after a single immunity challenge.
immunity %>% filter(season == 40) %>% unnest(immunity) #> # A tibble: 23 x 8 #> season_name season episode title voted_out day order immunity #>#> 1 Survivor: Winner~ 40 1 Greatest of ~ Natalie 2 1 Dakal #> 2 Survivor: Winner~ 40 1 Greatest of ~ Amber 3 2 Sele #> 3 Survivor: Winner~ 40 2 It's Like a ~ Danni 6 3 Dakal #> 4 Survivor: Winner~ 40 3 Out for Blood Ethan 9 4 Dakal #> 5 Survivor: Winner~ 40 4 I Like Reven~ Tyson 11 5 Sele #> 6 Survivor: Winner~ 40 5 The Buddy Sy~ Rob 14 6 Sele #> 7 Survivor: Winner~ 40 5 The Buddy Sy~ Rob 14 6 Dakal #> 8 Survivor: Winner~ 40 6 Quick on the~ Parvati 16 7 Yara #> 9 Survivor: Winner~ 40 6 Quick on the~ Sandra 16 8 Yara #> 10 Survivor: Winner~ 40 7 We're in the~ Yul 18 9 Yara #> # ... with 13 more rows
A nested tidy data frame of reward challenge result where each row is a reward challenge. Typically in the merge, if a single person wins a reward they are allowed to bring others along with them. The first castaway in the expanded list is the winner. Subsequent players are those who the winner brought along with them to the reward. Although, not always. Occasionally in the merge, the castaways are split into two teams for the purpose of the reward, in which case all castaways win the reward rather than a single person. If reward
is missing there was no reward challenge for the episode.
rewards %>% filter(season == 40) %>% unnest(reward) #> # A tibble: 29 x 6 #> season_name season episode title day reward #>#> 1 Survivor: Winners at ~ 40 1 Greatest of the Greats 2 Dakal #> 2 Survivor: Winners at ~ 40 1 Greatest of the Greats 3 #> 3 Survivor: Winners at ~ 40 2 It's Like a Survivor Econ~ 6 Dakal #> 4 Survivor: Winners at ~ 40 3 Out for Blood 9 Dakal #> 5 Survivor: Winners at ~ 40 4 I Like Revenge 11 Sele #> 6 Survivor: Winners at ~ 40 5 The Buddy System on Stero~ 14 #> 7 Survivor: Winners at ~ 40 6 Quick on the Draw 16 Yara #> 8 Survivor: Winners at ~ 40 7 We're in the Majors 18 Yara #> 9 Survivor: Winners at ~ 40 7 We're in the Majors 18 Sele #> 10 Survivor: Winners at ~ 40 8 This is Where the Battle ~ 21 Tyson #> # ... with 19 more rows
This data frame contains the history of jury votes. It is more verbose than it needs to be. However, having a 0-1 column indicating if a vote was placed for the finalist makes it easier to summarise castaways that received no votes.
jury_votes %>% filter(season == 40) #> # A tibble: 48 x 5 #> season_name season castaway finalist vote #>#> 1 Survivor: Winners at War 40 Sarah Michele 0 #> 2 Survivor: Winners at War 40 Sarah Natalie 0 #> 3 Survivor: Winners at War 40 Sarah Tony 1 #> 4 Survivor: Winners at War 40 Ben Michele 0 #> 5 Survivor: Winners at War 40 Ben Natalie 0 #> 6 Survivor: Winners at War 40 Ben Tony 1 #> 7 Survivor: Winners at War 40 Denise Michele 0 #> 8 Survivor: Winners at War 40 Denise Natalie 0 #> 9 Survivor: Winners at War 40 Denise Tony 1 #> 10 Survivor: Winners at War 40 Nick Michele 0 #> # ... with 38 more rows jury_votes %>% filter(season == 40) %>% group_by(finalist) %>% summarise(votes = sum(vote)) #> # A tibble: 3 x 2 #> finalist votes #> #> 1 Michele 0 #> 2 Natalie 4 #> 3 Tony 12
A data frame containing the viewer information for every episode across all seasons. It also includes the rating and viewer share information for viewers aged 18 to 49 years.
viewers %>% filter(season == 40) #> # A tibble: 14 x 9 #> season_name season episode_number_~ episode title episode_date viewers #>#> 1 Survivor: ~ 40 583 1 Grea~ 2020-02-12 6.68 #> 2 Survivor: ~ 40 584 2 It's~ 2020-02-19 7.16 #> 3 Survivor: ~ 40 585 3 Out ~ 2020-02-26 7.14 #> 4 Survivor: ~ 40 586 4 I Li~ 2020-03-04 7.08 #> 5 Survivor: ~ 40 587 5 The ~ 2020-03-11 6.91 #> 6 Survivor: ~ 40 588 6 Quic~ 2020-03-18 7.83 #> 7 Survivor: ~ 40 589 7 We'r~ 2020-03-25 8.18 #> 8 Survivor: ~ 40 590 8 This~ 2020-04-01 8.23 #> 9 Survivor: ~ 40 591 9 War ~ 2020-04-08 7.85 #> 10 Survivor: ~ 40 592 10 The ~ 2020-04-15 8.14 #> 11 Survivor: ~ 40 593 11 This~ 2020-04-22 8.16 #> 12 Survivor: ~ 40 594 12 Frie~ 2020-04-29 8.08 #> 13 Survivor: ~ 40 595 13 The ~ 2020-05-06 7.57 #> 14 Survivor: ~ 40 596 14 It A~ 2020-05-13 7.94 #> # ... with 2 more variables: rating_18_49 , share_18_49
This data frame contains the tribe names and colours for each season, including the RGB values. These colours can be joined with the other data frames to customise colours for plots. Another option is to add tribal colours to ggplots with the scale functions.
tribe_colours #> # A tibble: 139 x 7 #> season_name season tribe_name r g b tribe_colour #>#> 1 Survivor: Winners at War 40 Sele 0 103 214 #0067D6 #> 2 Survivor: Winners at War 40 Dakal 216 14 14 #D80E0E #> 3 Survivor: Winners at War 40 Yara 4 148 81 #049451 #> 4 Survivor: Winners at War 40 Koru 0 0 0 #000000 #> 5 Survivor: Island of the Ido~ 39 Lairo 243 148 66 #F39442 #> 6 Survivor: Island of the Ido~ 39 Vokai 217 156 211 #D99CD3 #> 7 Survivor: Island of the Ido~ 39 Lumuwaku 48 78 210 #304ED2 #> 8 Survivor: Edge of Extinction 38 Manu 16 80 186 #1050BA #> 9 Survivor: Edge of Extinction 38 Lesu 0 148 128 #009480 #> 10 Survivor: Edge of Extinction 38 Kama 250 207 34 #FACF22 #> # ... with 129 more rows
Included are ggplot2 scale functions (of the form scale_*_survivor()
) to add tribe colours to ggplot. Simply input the season number desired to use those tribe colours. If the fill or colour aesthetic is the tribe name, this needs to be passed to the scale function as scale_fill_survivor(…, tribe = tribe)
(for now) where tribe
is on the input data frame. If the fill or colour aesthetic is independent of the actual tribe names, tribe
does not need to be specified and will simply use the tribe colours as a colour palette, for example, the viewers line graph above which used the Micronesia colour palette.
ssn <- 35 labels <- castaways %>% filter( season == ssn, str_detect(result, "Sole|unner") ) %>% select(nickname, original_tribe) %>% mutate(label = glue("{nickname} ({original_tribe})")) %>% select(label, nickname) jury_votes %>% filter(season == ssn) %>% left_join( castaways %>% filter(season == ssn) %>% select(nickname, original_tribe), by = c("castaway" = "nickname") ) %>% group_by(finalist, original_tribe) %>% summarise(votes = sum(vote)) %>% left_join(labels, by = c("finalist" = "nickname")) %>% { ggplot(., aes(x = label, y = votes, fill = original_tribe)) + geom_bar(stat = "identity", width = 0.5) + scale_fill_survivor(ssn, tribe = .$original_tribe) + theme_minimal() + labs( x = "Finalist (original tribe)", y = "Votes", fill = "Original\ntribe", title = "Votes received by each finalist" ) }
This data provides a way to deeper analyse each season and the plays within each episode. For example, we could construct a graph of who voted for who, where the castaway is the node and the edge is who they voted for using the vote history data. While in this representation it’s possible to use clustering algorithms to identify alliances in the data. Other uses include identifying the probability of players jumping ship and pivotal votes. This is particularly interesting for the first 1 or 2 tribals of the merge to see if players stick with their original tribe or jump ship.
ssn <- 40 df <- vote_history %>% filter( season == ssn, order == 13 ) nodes <- df %>% distinct(castaway) %>% mutate(id = 1:n()) %>% rename(label = castaway) edges <- df %>% count(castaway, vote) %>% left_join( nodes %>% rename(from = id), by = c("castaway" = "label") ) %>% left_join( nodes %>% rename(to = id), by = c("vote" = "label") ) %>% mutate(arrows = "to") %>% rename(value = n) %>% left_join( castaways %>% filter(season == ssn) %>% select(nickname, original_tribe), by = c("castaway" = "nickname") ) labels <- edges %>% select(from, to, castaway, original_tribe) %>% distinct(from, castaway, original_tribe) %>% arrange(castaway) %>% left_join( edges %>% count(vote), by = c("castaway" = "vote") ) cols <- tribe_colours$tribe_colour names(cols) <- tribe_colours$tribe ggraph( edges %>% rename(`Original tribe` = original_tribe), layout = "linear") + geom_edge_arc(aes(colour = `Original tribe`), arrow = arrow(length = unit(4, "mm"), type = "closed"), end_cap = circle(10, 'mm')) + geom_node_point(size = 26, colour = cols[labels$original_tribe]) + geom_node_point(size = 24, colour = "black") + geom_node_text(aes(label = labels$castaway), colour = "grey", size = 4, vjust = 0, family = ft) + geom_node_text(aes(label = labels$n), colour = "grey", size = 4, vjust = 2, family = ft) + scale_edge_colour_manual(values = cols[unique(edges$original_tribe)]) + scale_colour_manual(values = cols[unique(edges$original_tribe)]) + theme_graph()
I intend to update the survivoR package each week during the airing of future seasons. For Survivor and data nuts like myself, this will enable a deeper analysis of each episode, and just neat ways visualise the evolution of the game.
New features will be added, such as details on exiled castaways across the seasons. If you have a request for specific data let me know in the comments and I’ll see what I can do. Also, if you’d like to contribute by adding to existing datasets or contribute a new dataset, please contact me directly on the contacts page.
Given the variable nature of the game of Survivor and how the rules are tweaked each season, there are bound to be edge cases where the data is not quite right. Please log an issue on Github, or with me directly in the comments and I will correct the datasets.
Data in the survivoR package was mostly sourced from Wikipedia. Other data, such as the tribe colours, was manually recorded and entered myself.
Torch graphic in hex: Fire Torch Vectors by Vecteezy
The post survivoR | Data from the TV series in R appeared first on Daniel Oehm | Gradient Descending.
Chapter 3 of my course Empirical Economics with R mainly dives into the difficulties of estimating causal effects. (See the previous post for an overview of th...
The post Empirical Economics with R (Part B): Confounders, Proxies and Sources of Exogenous Variations first appeared on R-bloggers.]]>[This is a duplicate of the previous post, which was not detected by all blog aggregators.]
Chapter 3 of my course Empirical Economics with R mainly dives into the difficulties of estimating causal effects. (See the previous post for an overview of the earlier chapters dealing with prediction.)
One important point I emphasize in this chapter with several small applications and simulations is that for the estimation of causal effects, it does typically not suffice to only think about possible confounders, but we should also be confident that there is a compelling source of exogenous variation for our explanatory variable of interest. To be more concrete, consider the regression model
\[y = \beta_0 + \beta_1 x_1 + u\]where $\beta_1$ shall measure a linear causal effect of $x_1$ on $y$ that we would like to estimate. The following graph shall illustrate the true data generating process:
The variable $x_2$ shall be a confounder that influences both $x_1$ and the dependent variable $y$. If we omit the confounder $x_2$ from our regression our OLS estimator $\hat \beta_1$ of the causal effect $\beta_1$ will be biased and might be far off the true causal effect. Typically, we won’t be able to exactly control for every confounder, but we may observe proxy variables that are correlated with the confounders. Adding a good proxy (high correlation with confounder) as control variable may substantially reduce the bias of our estimator.
Let us look at a simulated data set:
sim.dat = function(n=100000,beta0=0, beta1=1,beta2=1, sd.proxy.noise=1, sd.exo=1) { x2 = rnorm(n,0,1) proxy = x2+rnorm(n,0,sd.proxy.noise) x1 = x2+rnorm(n,0,sd.exo) eps = rnorm(n,0,1) y = beta0+beta1*x1 + beta2*x2 + eps data.frame(y,x1,x2,proxy) } set.seed(123) dat = sim.dat(beta1=1,sd.proxy.noise = 0.05, sd.exo = 1) library(stargazer) stargazer(lm(y~x1,dat), lm(y~x1+x2,dat), lm(y~x1+proxy,dat),type="html", keep.stat = c("rsq"))
Dependent variable: | |||
y | |||
(1) | (2) | (3) | |
x1 | 1.502^{***} | 1.002^{***} | 1.005^{***} |
(0.003) | (0.003) | (0.003) | |
x2 | 1.000^{***} | ||
(0.004) | |||
proxy | 0.994^{***} | ||
(0.004) | |||
Constant | 0.001 | 0.001 | 0.0003 |
(0.004) | (0.003) | (0.003) | |
R^{2} | 0.750 | 0.833 | 0.832 |
Note: | ^{*}p<0.1; ^{**}p<0.05; ^{***}p<0.01 |
While we estimate the causal effect $\beta_1 = 1$ with a large bias if we don’t add any control variable, controlling for proxy
is in our simulation almost as good as if directly controlling for the confounder x2
. The reason is that our proxy is extremely good, it only has very small random noise (determined by sd.proxy.noise
) and is thus highly correlated with the confounder:
cor(dat$x2, dat$proxy) ## [1] 0.9987416
The following simulation only reduces the exogenous variation in x1
by reducing the parameter sd.exo
, i.e. now the largest source of variation of x1
is the confounder x2
.
set.seed(123) dat = sim.dat(beta1=1,sd.proxy.noise = 0.05, sd.exo = 0.025) stargazer(lm(y~x1,dat), lm(y~x1+x2,dat), lm(y~x1+proxy,dat),type="html",keep.stat = c("rsq"))
Dependent variable: | |||
y | |||
(1) | (2) | (3) | |
x1 | 2.001^{***} | 1.079^{***} | 1.971^{***} |
(0.003) | (0.127) | (0.057) | |
x2 | 0.922^{***} | ||
(0.127) | |||
proxy | 0.030 | ||
(0.056) | |||
Constant | 0.001 | 0.001 | 0.001 |
(0.003) | (0.003) | (0.003) | |
R^{2} | 0.800 | 0.800 | 0.800 |
Note: | ^{*}p<0.1; ^{**}p<0.05; ^{***}p<0.01 |
While our proxy is still great (cor(x2,proxy)=0.9987
), now adding the proxy as control variable helps almost nothing to reduce the bias of our OLS estimator $\hat \beta_1$. I think this simulation illustrates a very important insight:
For estimation of causal effects, it does typically not suffice to well control for confounders, we also need a sufficiently strong source of exogenous variation for our explanatory variable of interest.
Many modern empirical economic articles that study non-experimental data are very aware of this point and carefully describe the identifying variation in the variable of interest. Yet, somehow I never really got this point in my own days as a student. Therefore, one focus of this chapter is to discuss the possible sources of variation of the explanatory variable of interest in several small applications. One conclusion is that many typical field data sets just don’t allow to convincingly estimate certain causal effects because there is no clear source of exogenous variation.
Note that thinking of exogenous variation is also important if you apply modern machine learning techniques adapted to the estimation of causal effects. I discussed this point in context of post-double selection with lasso regressions here.
There are several other insights that chapter 3 point outs:
If possible try to run a randomized experiment to generate exogenous variation in the explanatory variable of interest.
If we don’t have a consistent estimator $\hat \beta_1$ of a certain causal effect, also confidence intervals can be far away from the true causal effect.
Under relative weak assumptions $\hat \beta$ estimates consistently the so called coefficients of the best linear predictor. Just by looking at a regression equation it is not clear what a stated $\beta_1$ shall measure, a particular causal effect or a coefficient of the best linear predictor.
The chapter also illustrates how to estimate non-linear and heterogeneous treatment effects. My main interpretation of simple linear specifications without interaction terms, is that their coefficients measure some average over likely more complex non-linear and heterogeneous effects. Yet, the potential outcomes framework, which takes heterogeneity much more serious, will be discussed only later in chapter 5.
You can find all material in the course’s Github repository. Take a look at the setup instructions if you want to solve the RTutor problem sets on your own computer.
Some sneakily cool features made it into the JuliaCall v0.17.2 CRAN release. With the latest version there is now an install_julia function for automatically installing Julia. This makes Julia a great high performance back end for R packages. For example, the following is an example from the diffeqr package that will work, even without Julia installed:
install.packages("diffeqr") library(diffeqr) de <- diffeqr::diffeq_setup() lorenz <- function (u,p,t){ du1 = p[1]*(u[2]-u[1]) du2 = u[1]*(p[2]-u[3]) - u[2] du3 = u[1]*u[2] - p[3]*u[3] c(du1,du2,du3) } u0 <- c(1.0,1.0,1.0) tspan <- c(0.0,100.0) p <- c(10.0,28.0,8/3) prob <- de$ODEProblem(lorenz,u0,tspan,p) fastprob <- diffeqr::jitoptimize_ode(de,prob) sol <- de$solve(fastprob,de$Tsit5(),saveat=0.01)
Under the hood it's using the DifferentialEquations.jl package and the SciML stack, but it's abstracted from users so much that Julia is essentially an alternative to Rcpp with easier interactive development. The following example really brings the seamless integration home:
install.packages(diffeqr) library(diffeqr) de <- diffeqr::diffeq_setup() degpu <- diffeqr::diffeqgpu_setup() lorenz <- function (u,p,t){ du1 = p[1]*(u[2]-u[1]) du2 = u[1]*(p[2]-u[3]) - u[2] du3 = u[1]*u[2] - p[3]*u[3] c(du1,du2,du3) } u0 <- c(1.0,1.0,1.0) tspan <- c(0.0,100.0) p <- c(10.0,28.0,8/3) prob <- de$ODEProblem(lorenz,u0,tspan,p) fastprob <- diffeqr::jitoptimize_ode(de,prob) prob_func <- function (prob,i,rep){ de$remake(prob,u0=runif(3)*u0,p=runif(3)*p) } ensembleprob = de$EnsembleProblem(fastprob, prob_func = prob_func, safetycopy=FALSE) sol <- de$solve(ensembleprob,de$Tsit5(),degpu$EnsembleGPUArray(),trajectories=10000,saveat=0.01)
This example does the following:
What a complicated code! Well maybe it would shock you to know that the source code for the diffeqr package is only 150 lines of code. Of course, it's powered by a lot of Julia magic in the backend, and so can your next package. For more details, see the big long post about differential equation solving in R with Julia.
The post JuliaCall Update: Automated Julia Installation for R Packages appeared first on Stochastic Lifestyle.
shiny.semantic::grid() The main job of a data scientist is to provide meaningful insights, mostly through data visualizations and dashboards. More often than not, data science professionals struggle with HTML and CSS, which makes building an aesthetically-pleasing layout near to impossible. And they are not to blame – that’s ...
The post Introducing shiny.semantic::grid() – Build Your Shiny Dashboard Layout in Seconds first appeared on R-bloggers.]]>The main job of a data scientist is to provide meaningful insights, mostly through data visualizations and dashboards. More often than not, data science professionals struggle with HTML and CSS, which makes building an aesthetically-pleasing layout near to impossible. And they are not to blame – that’s a job for frontend developers. Still, there’s a simple solution available.
Want to learn Shiny from the ground up? Register for our free 90-minute workshop on the 20th of January, 2021.
At Appsilon, we’ve developed the shiny.semantic
package, which comes with predefined components for your dashboard. One of these components is the grid, enabling you to build any layout in seconds, not hours or days.
Navigate to a section:
Note: presented functionality is available from shiny.semantic 0.4.2
CSS Grid system is a modern approach for complex responsive web layouts. Appsilon’s shiny.semantic
package gives you two grid functions:
grid_template()
– for defining how your layout should be displayed on desktop and mobilegrid()
– for rendering a layout template in your Shiny UIThe key to the grid are grid template areas. They allow you to define a layout like this (in raw CSS):
The code displayed in the previous image results in this layout:
Let’s explore how to build a simple grid-powered layout next.
The shiny.semantic
package simplifies low-level CSS and gives you a convenient grid_template()
function. You provide the desired desktop and mobile layout inside it. Area names will be used later as keys in the grid()
function upon providing area content (inner HTML).
And here is how you render it in Shiny:
For debugging your template, use the display_grid()
function, which lets you view your current layout and try it out:
The corresponding results are shown in the image below:
Here is an example Shiny POC dashboard built in a couple of hours using shiny.semantic
package and grid layout:
If you like the package and want more functionalities like this, don’t forget to the repository.
Appsilon is hiring for remote roles! See our Careers page for all open positions, including R Shiny Developers, Fullstack Engineers, Frontend Engineers, a Senior Infrastructure Engineer, and a Community Manager. Join Appsilon and work on groundbreaking projects with the world’s most influential Fortune 500 companies.
Article Introducing shiny.semantic::grid() – Build Your Shiny Dashboard Layout in Seconds comes from Appsilon | End to End Data Science Solutions.