My package rAmCharts4 has a moderate success on Github (twenty stars). So I decided to present it here.
The post How to draw heatmap in r: Quick and Easy way appeared first on
Data Science Tutorials –
How to draw heatmap in r?, A heatmap is essentially a table with colors replacing the numbers. The colors represent the measuring level. It can help you locate highs and lows, as well as patterns.
In this article, We’ll show you how to create a clustered heatmap with the R programming language’s pheatmap package.
Let’s get this article started,
Raivo Kolde’s pheatmap program provides great control over heatmap proportions and appearance. The package’s main benefit is that it allows users to visually cluster heatmaps.
How to interpret Margin of Error Results? »
Software Package with Example Data and Heatmaps
Let’s start by creating some sample data.
set.seed(123) data < matrix(round(rnorm(150), 2), nrow = 15) colnames(data) < LETTERS[1:10] rownames(data) < letters[1:15] data A B C D E F G H I J a 0.56 1.79 0.43 1.12 0.38 1.03 0.99 0.05 0.12 1.13 b 0.23 0.50 0.30 0.40 0.50 0.28 0.55 0.78 0.95 1.46 c 1.56 1.97 0.90 0.47 0.33 1.22 0.24 1.67 0.49 0.74 d 0.07 0.70 0.88 0.78 1.02 0.18 0.63 0.38 0.26 1.91 e 0.13 0.47 0.82 0.08 1.07 0.14 1.36 0.92 1.84 1.44 f 1.72 1.07 0.69 0.25 0.30 0.01 0.60 0.58 0.65 0.70 g 0.46 0.22 0.55 0.03 0.45 0.39 2.19 0.61 0.24 0.26 h 1.27 1.03 0.06 0.04 0.05 0.37 1.53 1.62 0.08 1.57 i 0.69 0.73 0.31 1.37 0.92 0.64 0.24 0.06 0.96 1.51 j 0.45 0.63 0.38 0.23 2.05 0.22 1.03 0.52 0.07 1.60 k 1.22 1.69 0.69 1.52 0.49 0.33 0.71 0.30 1.44 0.53 l 0.36 0.84 0.21 1.55 2.31 1.10 0.26 0.11 0.45 1.46 m 0.40 0.15 1.27 0.58 1.01 0.44 0.25 0.64 0.04 0.69 n 0.11 1.14 2.17 0.12 0.71 0.33 0.35 0.85 0.42 2.10 o 0.56 1.25 1.21 0.22 0.69 1.15 0.95 1.02 2.05 1.29
Take a look at the table above. Our sample data is a matrix with 15 rows and ten numerical columns labeled “A,” “B,” “C,” “D,” “E,” “F,” “G,” “H,” “I,” and “J.”
If we wish to use the functions provided in the package, we must additionally install and load the pheatmap package into R.
install.packages("pheatmap") library("pheatmap")
Let’s use the pheatmap package to create some heatmaps!
The pheatmap function is used to draw a heatmap in the following code. Note that the only thing we need to mention is the name of the data matrix we wish to draw.
pheatmap(data)
Figure 1 shows the visual made by the previous R code – a heatmap created using the pheatmap package’s default settings.
How to make a rounded corner bar plot in R? – Data Science Tutorials
The pheatmap function has a number of extra options that can be used to make the heatmap more appealing and show more clusters.
Example 2 explains how to partition our heatmap into a specific number of clusters using the kmeans cluster method (i.e. 4).
pheatmap(data, kmeans_k = 4)
Our data is represented in Figure 2 as a heatmap with four clusters. The cells in these clusters have been merged, as you can see.
We’ll teach you how to partition our heatmap without merging the cells in the distinct groups in Example 3. (as in Example 2).
As shown below, the cutree rows option in the pheatmap function can be used for this.
5 Free Books to Learn Statistics For Data Science – Data Science Tutorials
pheatmap(data, cutree_rows = 4)
Our heatmap is divided into four categories by rows in Figure 3.
This example shows how to divide a heatmap into columns and rows (not only by rows as in Example 3).
In addition to the cutree rows parameter, we must also specify the cutree cols argument.
pheatmap(data, cutree_rows = 4, cutree_cols = 3)
Our heatmap is clustered by rows and columns in Figure 4.
For interesting posts, don’t forget to subscribe to the newsletter.
The post How to draw heatmap in r: Quick and Easy way appeared first on Data Science Tutorials
Cover photo by Eric Ward on Unsplash
Go to Rbloggers for R news and tutorials contributed by hundreds of R bloggers.
This article is the Part 3 of the series (Swarm) Portfolio Optimization. I assume you have read Part 2 – Notion of nondominated solutions.
Here, I am going to take a step back and resolve a singleobjective optimization with a Particle Swarm Optimization (PSO) algorithm before moving forward with multiobjective optimization introduced in Part 2 – Notion of nondominated solutions.
The proposed code is not an optimized one. I wanted to differentiate this article from other Rrelated ones by implementing the PSO algorithm with tibbles: saving all the population information at each iteration in this format can be useful if you are willing to e.g., plot dynamic movement of the swarm, and perform further analyses.
This chapter is based on Maurice Leclerc‘s book “Particle Swarm Optimization” [1]. There are tons of articles and bibliography on PSO and its variants. I will stick to a simple version for PSO, without informants, for the sake of clarity based on the algorithm described in the next chapter.
We are interested only in continuous problems with real variables. A search space is defined, for example, classically, like one (hyper)cube of the form [XMIN, XMAX]^D.
Initialization simply consists of initially randomly placing the particles according to a uniform distribution in this search space.
The particles have velocities. By definition, a velocity is a vector or, more precisely, an operator, which, applied to a position, will give another position. It is in fact a displacement, called velocity because the time increment of the iterations is always implicitly regarded as equal to 1.
For the moment, let us be satisfied with deriving at random the values of the components of each velocity, according to a uniform distribution in:
The dimension of the search space is D. Therefore, the current position or location of a particle in this space at the moment t is given by a vector x(t), with D components.
Its current velocity is v(t). The best position or location found up to now by this particle is given by a vector p(t).
Lastly, the best position or location found by informants of the particle is indicated by a vector g(t).
In general, we will write simply x, v, p, and g.
The dth component of one of these vectors is indicated by the index d, for example xd.
With these notations, the equations of motion of a particle are, for each dimension d:
c1 is constant (confidence in its own movement) and it will be represented by the VELOCITY_FACTOR
constant.
To use this model, the two parameters c1 and cmax (represented by the CMAX
constant) must be defined. CMAX
can be regarded as the maximum confidence granted by the particle to any performance transmitted by another. For each problem, the right values can be found only by experiment: both VELOCITY_FACTOR and CMAX must not be chosen independently and the following couple of values are recommended
(VELOCITY_FACTOR
= 0.7, CMAX
= 1.47)
(VELOCITY_FACTOR
= 0.8, CMAX
= 1.62)
The first product cmax . alea(0,1) in the first equation of motion will be denoted as C1
(confidence granted by the particle to its selfperformance), and second one as C2
(confidence granted by the particle to the best particle performance). Note: C2
is supposed to be confidence granted by the particle to its best informant performance, but as stated earlier in this article we do not consider Leclerc’s implementation with informants.
In practice, it is not desirable that too many particles tend to leave the search space as early as the first increment. We supplement the mechanism of confinement (which brings back the particle to the border of the search space) with a velocity modification. In this article we choose to perform a velocity component cancellation thus, the complete mechanism is described by the following operations:
This chapter is based on a book Metaheuristic and Evolutionary Algorithms for Engineering Optimization [2]
Generally speaking, the objective of the PSO algorithm is to minimize a function f(x) where x denotes a particle.
We start with an initial population.
For any particle, a position x is said to be better than a position y if f(x)
Now the population is updated by the movement of particles around the search space.
Each particle i remembers the best position Pi attained by it in the past.
Also the swarm remembers the global best value G attained by any particle in the past.
Each particle’s movement is determined by its current velocity, its own best known position and the global best value of the swarm.
The swarm is expected to move towards the global best solution or at least towards a satisfactory solution for which the function value is close to the global minimum.
The PSO algorithm implemented in this article is the following:
We will use the same 4asset portfolio (MSFT, AAPL, GOOG, NFLX) as in Part 2.
The goal is to find the optimal solution (or portfolio weights) for the singleobjective optimization problem below. I have strikethrough the second objective function from Part 2, leaving only the minimization of the first objective function: the ValueatRisk (VaR), relative to the mean, of the portfolio.
Load the same packages as in Part 2
library(PerformanceAnalytics) library(quantmod) library(tidyquant) library(tidyverse) library(timetk) library(skimr) # not used in this article library(plotly) # not used in this article library(caret) # not used in this article library(tictoc)
Algorithm steps 1. (population size as NUMBER_PARTICLES
) and 5. (w as VELOCITY_FACTOR
) will be defined along with the other PSO parameters.
# PSO parameters NUMBER_OF_ASSETS = 4 # D = 4 dimensions NUMBER_PARTICLES = 40 # N = population size VELOCITY_FACTOR = .7 # confidence in its own movement XMIN = .05 # used for mechanism of confinement, not as constraint XMAX = .80 # used for mechanism of confinement, not as constraint CMAX = 1.47 # maximum confidence granted by the particle to any performance transmitted by another VMAX = (XMAXXMIN)/2 VMIN = (XMINXMAX)/2 MAX_ITERATIONS = 50
As explained in the code snippet below, if you have already saved as returns_daily_tbl.rds
the 4 assets daily returns generated in Part 2 chapter “Collect prices, compute daily returns and clean data“, you can load it directly and apply the formatTibble()
function to generate the returns
tibble that will be used in this article.
If you haven’t saved the daily returns then run the code snippet in Part 2’s chapter “Collect prices, compute daily returns and clean data” and don’t forget to apply the formatTibble()
function.
## Assets & Daily returns data  # Microsoft, Apple, Google, Netflix assets < c("MSFT", "AAPL", "GOOG", "NFLX") returns_daily_tbl < readRDS("<RDS file path>/returns_daily_tbl.rds") returns < returns_daily_tbl %>% formatTibble() rm(returns_daily_tbl)
At each iteration the program will generate a tibble with the following format: each line is a particle representing a portfolio of the 4 assets. Since there are 4 assets there will be 4 components for the location, velocity and best location of each particle.
Columns x_1, x_2, x_3, x_4
are the location or position components of the particle.
Column pRelVaR
is the “relative to the mean” VaR (see Part 1) calculated with the previous locations (see function computeRelVaR()
below).
Columns v_1, v_2, v_3, v_4
are the velocity components of the particle.
Columns p_1, p_2, p_3, p_4
are the best location components of the particle.
The initializePopulation()
function applies algorithm steps 2., 3., 4., 6., 7., and it will be run once.
You can save the resulting tibble (with saveRDS
for instance) in order to rerun the program several times using the same initial population.
The initializePopulation()
function calls the following three functions:
generateRandomLocations()
: for steps 2. and 3.
generateRandomVelocities()
: for step 4.
computeRelVaR()
: for step 7., thus it needs the daily returns tibble.
Step 6. does not need a specific function, all best locationsp_
will be initialized with the initial locations x_
values.
Here’s the reference initial population tibble used in this article and for other optimization scenarios e.g., changing parameter couple (VELOCITY_FACTOR, CMAX
) and/or population size NUMBER_PARTICLES
.
init_pop < initializePopulation(returns = returns) > init_pop # A tibble: 40 x 13 x_1 x_2 x_3 x_4 pRelVaR v_1 v_2 v_3 v_4 p_1 p_2 p_3 p_4 <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> 1 0.137 0.0829 0.392 0.389 1.15 0.266 0.0821 0.210 0.245 0.137 0.0829 0.392 0.389 2 0.604 0.106 0.144 0.146 1.10 0.227 0.360 0.350 0.0668 0.604 0.106 0.144 0.146 3 0.341 0.350 0.176 0.133 1.10 0.257 0.0866 0.122 0.0764 0.341 0.350 0.176 0.133 4 0.0442 0.449 0.361 0.147 1.13 0.260 0.267 0.191 0.342 0.0442 0.449 0.361 0.147 5 0.144 0.103 0.399 0.355 1.14 0.0740 0.224 0.0213 0.155 0.144 0.103 0.399 0.355 6 0.409 0.0956 0.296 0.200 1.10 0.0740 0.310 0.293 0.255 0.409 0.0956 0.296 0.200 7 0.0575 0.316 0.198 0.429 1.14 0.0175 0.324 0.106 0.189 0.0575 0.316 0.198 0.429 8 0.0716 0.363 0.111 0.454 1.14 0.301 0.0417 0.0640 0.156 0.0716 0.363 0.111 0.454 9 0.0367 0.358 0.425 0.180 1.10 0.128 0.129 0.104 0.293 0.0367 0.358 0.425 0.180 10 0.109 0.164 0.246 0.481 1.18 0.271 0.0885 0.160 0.121 0.109 0.164 0.246 0.481 # ... with 30 more rows initializePopulation < function(returns){ ## Randomly generate locations of particles  init_locations < lapply(X = 1:NUMBER_PARTICLES, function(x){ generateRandomLocations(n = NUMBER_OF_ASSETS, min = XMIN, max = XMAX) }) init_locations < init_locations %>% bind_rows() ## Randomly generate velocities of particles  init_velocities < lapply(X = 1:NUMBER_PARTICLES, function(x){ generateRandomVelocities(n = NUMBER_OF_ASSETS, min = VMIN, max = VMAX) }) init_velocities < init_velocities %>% bind_rows() ## Initialize best locations as first locations  init_bestlocations < init_locations colnames(init_bestlocations) < sapply(X = 1:NUMBER_OF_ASSETS, function(x) paste0("p_",x)) ## Compute fitness function value, here pRelVaR, for each particle  init_particles < bind_cols(computeRelVaR(tbl = init_locations, returns = returns), init_velocities) init_particles < bind_cols(init_particles, init_bestlocations) return(init_particles) }
The function returns a tibble withh n lines, each composed of the x_1, x_2, x_3, x_4
location components randomly generated with the runif()
function. The min
and max
arguments default values will be overwritten with XMIN
and XMAX
values.
The locations must sum up to 1 (100%) since they represent the weigths of the assets.
generateRandomLocations < function(n, min=.05, max=.75){ tmp < runif(n = n, min = min, max = max) # Weights must sum up to 1 (100%) nweights < tmp/sum(tmp) new_cols < list() for(x in 1:n) { new_cols[[paste0("x_", x, sep="")]] = nweights[x] } res < as_tibble(new_cols) return(res) }
The function returns a tibble withh n lines, each composed of the v_1, v_2, v_3, v_4
velocity components randomly generated with the runif()
function. The min
and max
arguments default values will be overwritten with VMIN
and VMAX
values.
generateRandomVelocities < function(n, min=1, max=1){ velocities < runif(n = n, min = min, max = max) new_cols < list() for(x in 1:n) { new_cols[[paste0("v_", x, sep="")]] = velocities[x] } res < as_tibble(new_cols) return(res) }
The computeRelVaR()
function returns a tibble with NUMBER_OF_PARTICLES
lines each composed of the x_1, x_2, x_3, x_4
locations and its "relative to the mean" VaR result.
The computeRelVaR()
function calls the following three functions:
getWeigthedReturns()
: as defined in Part 2
getMeanWeigthedReturns()
: as defined in Part 2
getVaRValues()
: a simplified version of Part 2's getObjectiveFunctionsValues
function, it computes and returns VaRrelated values by calling only the objectiveFunction1
as defined in Part 2.
computeRelVaR < function(tbl, returns){ lrelVaRValues < lapply(1:nrow(tbl), function(x){ weights < unlist(tbl[x,c(1:NUMBER_OF_ASSETS)]) returns %>% getWeigthedReturns(vecA = assets, vecW = weights) %>% getMeanWeigthedReturns() %>% getVaRValues(vecW = weights) }) relVaRValues < lrelVaRValues %>% bind_rows() %>% select(starts_with("x_"), pRelVaR) return(relVaRValues) }
The following code snippet implements algorithm step 8.
It simply finds the particle within the initial population that has the minimum "relative to the mean" VaR.
g_particle < init_pop[which(init_pop$pRelVaR == min(init_pop$pRelVaR)),]
The code snippet below implements all remaining steps, from 9. to 12.
At each iteration we will save the corresponding population tibble with the iterations_data list. Its initialization is shown in the code snippet below.
iterations_data < vector("list", MAX_ITERATIONS) iterations_data[[1]] < list(pop = init_pop, gb = g_particle)
The main program calls the computeNextLocation()
function. This function computes the next location components x_1, x_2, x_3, x_4
based on the velocities v_1, v_2, v_3, v_4
which must be calculated by getVelocityMatrix()
based on current best locationsp_1, p_2, p_3, p_4
and global best location.
The new locations tibble will used to calculate the "relative to the mean" VaR (pRelVaR
) with the computeRelVaR()
function, already described above.
Next, we must identfy if they are any particles with a smaller pRelVaR
than the last iteration pRelVaR
. If yes, then we must update their best position components p_1, p_2, p_3, p_4
with their new location components x_1, x_2, x_3, x_4
, and "rebuild" the population tibble. Else, no best positions will be updated and the same population tibble will be used it next iteration hoping the the randomness introduced in getVelocityMatrix()
will generate new locations with smaller pRelVaR
.
In either case, we find global best particle with minimum pRelVaR and save it (g_particle). Then, we add the population tibble and globa best particle yo
# Iterations iter < 2 while(iter <= MAX_ITERATIONS){ # Compute next velocities and locations next_locvels < computeNextLocation(tbl = iterations_data[[iter1]]$pop, gbest = iterations_data[[iter1]]$gb) # Get velocities tibble next_vels < next_locvels$vels # Get new locations tibble next_fitnessval < computeRelVaR(tbl = next_locvels$locs, returns = returns) # Get best locations tibble next_bestloc < iterations_data[[iter1]]$pop %>% select(starts_with("p_")) # Identify particles with smaller pRelVaR than last iteration pRelVaR update_idx < which(next_fitnessval$pRelVaR < iterations_data[[iter1]]$pop$pRelVaR) # For particles with smaller pRelVaR than last iteration pRelVaR if(length(update_idx) > 0){ # Update particles' pbest locations (p) with new locations => update pbest tibble next_bestloc[update_idx, ] < next_fitnessval[update_idx, which(colnames(next_fitnessval)=="pRelVaR")] newRelVaRloc < computeRelVaR(tbl = next_bestloc, returns = returns) new_pop < bind_cols(bind_cols(newRelVaRloc, next_vels), next_bestloc) }else{ # No updates : did not find particles with smaller pRelVaR than last iteration pRelVaR new_pop < iterations_data[[iter1]]$pop } # Find global best particle with minimum pRelVaR g_particle < new_pop[which(new_pop$pRelVaR == min(new_pop$pRelVaR)),] iterations_data[[iter]] < list(pop = new_pop, gb = g_particle) # iterations_data < rlist::list.append(iterations_data, it_1 = list(pop = new_pop, gb = g_particle)) iter < iter + 1 } # Find particle with solution x_1, x_2, x_3, x_4 with minimum pRelVaR iterations_data[[MAX_ITERATIONS]]$gb
computeNextLocation()
functionIt starts by calling the getVelocityMatrix()
, decribed later, which will return particles' velocities as a matrix.
As stated by the second equation of motion, we must sum the velocities matrix to the current locations for which the tibble has been coverted to matrix.
Next, we must verify that the newly calculated locations values stay within the XMIN and XMAX bounds. If not, set its corresponding velocity coordinate to zero and bring the location coordinate value back to its limit, XMIN or XMAX according to the case (new location coordinate < XMIN or > XMAX).
Remember that locations represent the weights of the assets, thus they must sum up to 1 (100%). That is why the simply function asDistribution()
is called. This is where I will introduce some flexibility to the algorithm since afetr calling this function you may encounter a location coordinate x_
which is actually smaller than XMIN. See "red" weight vaues in the Excel sheet run results shown at the end of the article.
Both updated locations and velocities matrices will be converted back to tibbles prior being returned within a list.
asDistribution < function(vec){ vec/sum(vec) } computeNextLocation < function(tbl, gbest){ velmat < getVelocityMatrix(tbl = tbl, gbest = gbest) current_mat < as.matrix(tbl %>% select(starts_with("x_"))) res < current_mat + velmat # Verify that weights are within min and max bounds if (length(which(res < XMIN)) > 0) { velmat[which(res < XMIN)] = 0L res[which(res < XMIN)] = XMIN } if (length(which(res > XMAX)) > 0) { velmat[which(res > XMAX)] = 0L res[which(res > XMAX)] = XMAX } # Locations (assets weights) msut sum up to 1 lnorm < lapply(X = 1:nrow(tbl), FUN = function(x) asDistribution(res[x,])) # Convert to tibble locs < lnorm %>% bind_rows() %>% as_tibble() vels < as_tibble(velmat) colnames(vels) < sapply(X = 1:NUMBER_OF_ASSETS, FUN = function(x) paste0("v_",x)) # Return tibbles within a list return(list(locs=locs, vels=vels)) }
getVelocityMatrix()
functionHere is where the famous equations of motion (explained above) take place and where randomness is introduced by calculating C1
and C2
parameters.
The getVelocityMatrix()
breaks the population tibble into its main components (locations, best locations, and velocities) also as tibbles, and it will extract the best location of the current global best location particle as a vector.
Then the random parameters C1
and C2
will calculated by multiplying fixed parameter CMAX
by runif(1, min=0, max=1)
.
Each particle velocity vetor will be calculated by applying the first equation of motion as R vector operations. The result will be the vector velocities
.
We convert the vector velocities
into a matrix velmat
with number of lines as the number of particles and the number of columns as the number of assets.
Next, we must verify that the newly calculated velocties values stay within the VMIN
and VMAX
bounds. If not, bring the velocity coordinate back to its limit, VMIN
or VMAX
, according to the case.
Finally, return the velocities
matrix .
getVelocityMatrix < function(tbl, gbest){ current_tbl < tbl %>% select(starts_with("x_")) pbest_tbl < tbl %>% select(starts_with("p_")) velocity_tbl < tbl %>% select(starts_with("v_")) g < as.numeric(gbest %>% select(starts_with("p_"))) C1 < CMAX*runif(1,0,1) C2 < CMAX*runif(1,0,1) velocities < lapply(X = 1:nrow(tbl), FUN = function(x){ y < as.numeric(current_tbl[x,]) v < as.numeric(velocity_tbl[x,]) p < as.numeric(pbest_tbl[x,]) VELOCITY_FACTOR*v + C1*(py) + C2*(gy) }) velmat < matrix(data = as.numeric(unlist(velocities)), nrow = nrow(tbl), ncol = NUMBER_OF_ASSETS) # Setting columns names to "x_" since it will sum with a location matrix colnames(velmat) < sapply(X = 1:NUMBER_OF_ASSETS, FUN = function(x) paste0("x_",x)) if (length(which(velmat < VMIN)) > 0) velmat[which(velmat < VMIN)] = VMIN if (length(which(velmat > VMAX)) > 0) velmat[which(velmat > VMAX)] = VMAX return(velmat) }
Using the same initial population tibble (inti_pop
) I have run 19 optimization scenarios with different population sizes (NUMBER_OF_PARTICLES
) and 2 couples of (CMAX
, VELOCITY_FACTOR
) values.
The minimum "relative to the mean" VaR was 1.064213 with following solution of portfolio:
MSFT = 0.434 (43.4%)
APPL = 0.274 (27.4%)
GOOG = 0.0378 (3.78%)
NFLX = 0.255 (25.5%)
Below, the results and details of each of the 19 optimization scenarios sorted by ascending "relative to the mean" VaR values.
The "red" values in the portfolios (SOLUTIONS column) below correspond to ones being smaller than XMIN.
This article introduced the basics of the PSO algorithm with tibbles. The goal was to resolve a singleobjective optimization problem by minimizing the "relative to the mean" VaR, the objective function.
This version of the algorithm did not use the notion of group of informants per particle (as introduced in [1]) thus, all particles "know" the best global location.
Although location limits XMIN and XMAX were defined we did not use them as optimization constraints. Since the location components (weights of assets) must sum up to 1 (100%) some flexibility was introduced while doing this computation.
In Part 4 we will come back to our multiobjective optimization problem introduced in Part 2  Notion of nondominated solutions
[1] Leclerc M., "Particle Swarm Optimization", Lavoisier, 2005
[2] BozorgHaddad O., Solgi M., Loaiciga H., "Metaheuristic and Evolutionary Algorithms for Engineering Optimization"
16 May 2022
A great way to learn how to analyze data or improve the skills you have is to watch experienced people do it. It’s handy if you have experienced colleagues around you, but what if your colleagues are too busy to help coach you, you’re still trying to break into data analysis or are one of the many analysts now working remotely? It can be tricky to find good examples of someone doing a live analysis on new data and see how they approach it.
Luckily, thanks to a bunch of very generous people, I’m super excited about this new website I’ve JUST launched that’ll be a great help to many new and experienced analysts alike.
There’s a weekly data analysis challenge called Tidy Tuesday in which a new dataset is posted and everyone analyses it, sharing their results and R code. It’s a fun way to learn and improve your skill especially because you can see how other people got their results.
A very experienced Data Scientist called David Robinson has recorded his Tidy Tuesday analyses and posted them on Youtube – well over 80 of them, each at about an hour long! Watching his screencasts is a fantastic way to see how someone approaches an analyses when it’s the first time they see the data.
Perhaps showing even more dedication, Alex Cookson and Eric Fletcher have painstakingly gone through the recordings, timestamped and annotated them with a description of David’s activity and which packages and functions he’s used. Make no mistake, this is an incredible amount of work. They captured everything in this publicly available google sheet.
Based off my experience in generating website content from a spreadsheet for Big Book of R, and wanting to try build a website with Quarto (next generation of rmarkdown), I put everything together to create www.RScreencasts.com. Now you can more easily navigate the screencasts, the individual timestamps, packages, functions and all via a searchable web interface. Using Quarto was easier than I expected (I’ve never used blogdown or distill before) and thanks to chat with Gavin Masterson (twitter)who has tried it, I felt confident enough to give it a go.
A small hope of mine is that someone sees this and feels inspired to timestamp and annotate all of Julia Silge’s screencasts. If you do and can use the same format as the David Robinson timestamps, then let me know and we can update this site with her screencasts too – that’d be REALLY amazing.
I hope you find RScreencasts.com useful and as always, if you have any feedback I’m eager to hear it.
If you’ve found RScreencasts useful, you might also like this blogpost:
The post Launch of R Screencasts appeared first on Oscar Baruffa.
Data scientists concentrate on making sense of data through exploratory analysis, statistics, and models. Software developers apply a different set of knowledge with different tools. Although their focus is different, data science teams can benefit from adopting software development best practices. Version control, automated testing, and other dev skills help create reproducible, productionready code and tools.
Rachael Dempsey recently asked the Twitter community for suggestions on resources that data scientists can use to improve their software development skill set.
We received so many great recommendations that we wanted to summarize and share them here. This blog post walks through software development best practices that your team may want to adopt and where to find out more.
The areas discussed below are:
It’d be great if we could open an R script, click Run, and we have the output we need. However, code doesn’t always “just work.” As our projects get bigger and more complex, we need structure so that they are easy to manage and understand.
As Daniel Chen says in his Structuring Your Data Science Projects talk, a proper project structure gets us to the happy place where our code runs. Existing principles, templates, and tools can guide you:
A data scientist can test a function by inputting some values, seeing if the output is what you expect, modifying the code if there are issues, and rerunning to check the values again. However, this process leaves a lot of room for error. It’s easy to forget what changed and cause something else to break.
Automated testing offers a better option, such as with the pytest package or testthat package. Automated testing focuses on small, welldefined pieces of code. Data scientists write out explicit expectations. Tests are saved in a single location, making them easy to rerun. When a test fails, it’s clear where to look for the problem.
from foobar import foo, bar def test_foo_values(): assert foo(4) == 8 assert foo(2.2) == 1.9 def test_bar_limits(): assert bar(4, [1, 90], option=True) < 8 # If you want to test that bar() raises an exception when called with certain # arguments, e.g. if bar() should raise an error when its argument is negative: def test_bar_errors(): with pytest.raises(ValueError): bar(4, [2, 10]) # test passes if bar raises a ValueError
Incorporating automated testing in a workflow exposes problems early and makes it easier to alter code.
Have you ever had a script that worked great, but now you can’t reproduce the results on a new laptop? This may happen because of changes in the operating system, package versions, or other factors. You have to spend time figuring out why the output is suddenly different.
A reproducible environment ensures workflows run as they did in the past. Your team controls everything needed to run a project in a reproducible environment. Each person running the code can expect the same behavior since everything is standardized.
Virtual environments for Python or the renv package for R are examples of the tools that can help a data science team reproduce their work. They record the version of loaded packages in a project and can reinstall the declared versions of those packages.
For example, say one of your projects uses dplyr::group_map()
, introduced in dplyr v0.8.0. If a team member runs the code with an older version of dplyr, they will run into an error. The renv package captures the state of the environment in which the code was written and shares that environment with others so that they can run the code without issue. (Try out Josiah Parry’s example on GitHub.)
With reproducible environments, data scientists can collaborate on work and validate results regardless of when and where they are running code.
Data scientists produce many files. As these files evolve throughout a project, keeping track of the latest version becomes more challenging. If the team collaborates on the same file, someone may use an outdated version and has to spend time reconciling mismatched lines of code.
Version control with tools like Git and GitHub can alleviate these pains. Teams can manage asynchronous work and avoid conflict or confusion. It’s easy to track the evolution of files, find (and revert) changes, and resolve differences between versions.
mtcars$hp
instead of mtcars$am
)Using version control in data science projects makes collaboration and maintenance more manageable.
These are just a few areas that data science teams should concentrate on to improve their software development skill set. We hope that you find them helpful and are excited to learn more!
Interested in developing holistic workflows, improving debugging processes, and writing nonrepetitive code? Register for What They Forgot to Teach You About R at rstudio::conf(2022), a twoday session led by Shannon McClintock Pileggi, Jenny Bryan, and David Aja. Learn more on the rstudio::conf workshop page.
Introduction
Before 2022
In 2022
Probabilities computation in R
First draw
Second draw
Third draw
Game limited to 3 draws
Game limited to 5 draws
Game limited to 100 draws
Game limited to the number of necessary draws
Final winning probabilities
...
There is a popular TV show broadcasted in France and the frenchspeaking part of Belgium called “KohLanta”.
In this show, several adventurers are dropped off on a desert island with almost no food nor equipment (just a personal backpack with their clothes and a small portion of rice). They must learn to survive on the hostile island by building their hut, finding water and food, etc.
Each season, adventurers are divided into teams (called tribes), and the teams compete against each other in games involving ability, strength, thinking and endurance. Every other game, the winning tribe receives some food or survival equipment (something to fish or something to make a fire, for instance). The losing tribe receives nothing. For the other half of the games, each member of the losing tribe has to elect an adventurer. The adventurer with the most votes leaves the show definitely. The winning tribe goes back to its island with all of its members.
At some point during the show, the two competing tribes are grouped together into one single tribe, and it continues this time with each adventurer competing against each other (so they play individually). The winner of the show is the last one to “survive”.
The 2022 season started with 24 adventurers. Just before being grouped together, each tribe has to select an adventurer in the opposing tribe. The two selected adventurers become ambassadors of their tribe. The two ambassadors must then go to another island in order to choose the adventurer who is going to leaves the show definitely. The adventurer selected by the two ambassadors will not be part of the reunification of the two tribes and her adventure stops there.
Of course, both ambassadors want to eliminate a member of the opposing tribe (to arrive at the reunification with the most allies). If the two ambassadors cannot agree, they have to play a game that will determine which of the 2 ambassadors must leave the show. This game is entirely based on luck. For the rest of the article, we call this game the ambassadors’ game.
Here were the rules of the ambassadors’ game before the 2022 season:
There are two identical urns, one in front of each ambassador. Each urn contains exactly one black ball and one white ball. Each ambassador has to draw a ball among the 2 from his urn. Both urns are of course closed, so no one sees which ball is picked (nor which one is not picked). The winner of the game (remember that the winner stays in the show, the loser has to leave definitely) is the one who picks a white ball while the other ambassador draws a black ball. If both ambassadors draw the same ball (both black or both white), the balls are put back in the urns and the game start over (with the exact same conditions) until the two ambassadors draw a ball of different color.
For each draw, there are thus four possible results:
Let’s compute the probability for each result to occur. Since the two events are independent (the fact that ambassador A draws a white ball does not change the probability for ambassador B to draw a white ball makes the two events independent), we can multiply the probabilities to compute the joint probability of the two events.
With \(P_A\) (\(P_B\)) denoting the probability that ambassador from tribe A (B) draws a white ball, we have:
The sum of the 4 probabilities gives 1 (i.e., 100%), which makes sense since it covers all possible outcomes.
This means that, on the first draw, each ambassador has a probability of 25% to win the game (outcome 1 for ambassador A, outcome 2 for ambassador B).
Of course, since the game is repeated until there is a winner, probabilities for outcomes 3 and 4 tend, in the long run, to decrease until it becomes null (0%). If this statement is not straightforward to you, think about it like this: if you play that game with your friend up to 100 times, what is the probability that there is still no winner, meaning that you and your friend drew the same ball (never a different color) 100 times in a row. You conceive that it is highly unlikely.
In this context, since both urns are identical and the game is played indefinitely until there is a winner, ambassadors have exactly the same probability of winning that game. It is indeed a 5050 chance for each of them.
Of course, I would not write an article about that ambassadors’ game if it is was this easy and straightforward.
The 2022 season differs from the previous ones in the sense that for each game, the losing tribe receives an additional punishment (called a curse in the show). As a consequence, this year, the two urns at the ambassadors’ game were not identical.
To give you some context, the red tribe won against the yellow tribe in the last game before the ambassadors’ negotiation. The punishment for the yellow tribe was the following:
This is indeed a punishment for the yellow tribe since the game is not fair anymore: the ambassador of the red tribe clearly has a higher chance of winning that game compared to the ambassador of the yellow tribe.
For the curious among you, here is how it happened:
You guessed it by now, the reason for writing this article is of course not to explain you what actually happened—the news sites do it better and much faster than me. The reason is that I wanted to compute the chance of winning the game for each ambassador if they had not agreed on an adventurer to eliminate.^{2}
Moreover, I wanted to compute these probabilities in R through simulation. And why not, reuse the code in case organizers of KohLanta decide to change the rules again in the future.
Even if you do not watch KohLanta (because it is not broadcasted in your country, or you do not like the show), it could be of interest to those of you who want to see:
for loop
and a function can be used to answer the initial question.For the remaining of this article, we denote:
Based on the composition of the urns given by Denis Brogniart, we have:
p_c < 1 / 3 p_l < 1 / 2 q_c < (1  p_c) q_l < (1  p_l)
To start easy, let’s first compute the winning probabilities for each ambassador on the first draw only. Remember that to have a winner, balls must be of different colors.
Louana wins if and only if Colin draws a black ball and Louana draws a white ball:
# Louana winning on first draw l_win < q_c * p_l l_win ## [1] 0.3333333
Louana has a 33.33% chance of winning on the first draw.
On the other hand, Colin wins if and only if Louana draws a black ball and Colin draws a white ball:
# Colin winning on first draw c_win < q_l * p_c c_win ## [1] 0.1666667
Colin has a 16.67% chance of winning on the first draw.
You can already see that the game is in favour of Louana, as expected. Let’s see now how it evolves when playing several times.
To win exactly on the second draw, it must be a tie on the first draw.
# Louana winning on second draw tie < (p_c * p_l) + (q_c * q_l) l_win2 < tie * l_win l_win2 ## [1] 0.1666667 # Colin winning on second draw c_win2 < tie * c_win c_win2 ## [1] 0.08333333
To win exactly on the third draw, it must be a tie on the first two draws.
# Louana winning on third draw l_win3 < (tie^2) * l_win l_win3 ## [1] 0.08333333 # Colin winning on third draw c_win3 < (tie^2) * c_win c_win3 ## [1] 0.04166667
A pattern seems to emerge in the code.
We can already compute the probabilities of winning for each ambassador as if the game was limited to three draws, by summing the probabilities for each of the first three draws:
# Louana winning on draw 1, 2 or 3 l_win_tot < l_win + l_win2 + l_win3 l_win_tot ## [1] 0.5833333 # Colin winning on draw 1, 2 or 3 c_win_tot < c_win + c_win2 + c_win3 c_win_tot ## [1] 0.2916667
Now if we compute the probabilities as if the game was limited to 5 draws and generalize the computation to see the pattern even more clearly, we have:
# Louana winning l_win_tot < ((tie^0) * l_win) + ((tie^1) * l_win) + ((tie^2) * l_win) + ((tie^3) * l_win) + ((tie^4) * l_win) l_win_tot ## [1] 0.6458333 # Colin winning c_win_tot < ((tie^0) * c_win) + ((tie^1) * c_win) + ((tie^2) * c_win) + ((tie^3) * c_win) + ((tie^4) * c_win) c_win_tot ## [1] 0.3229167
We could continue like this for a long time, but for now let’s compute it for up to 100 draws, using a for loop
. Using a for loop
is necessary here in order to avoid to copypaste our computations a hundred times.
For the ease of illustration, we compute only the probability of winning for Louana. We will show later on how the probability for Colin can easily be computed.
n_draws < 100 # number of draws l_win_tot < c() # set empty vector for (i in 1:n_draws) { l_win_tot[i] < ((tie^(i  1)) * l_win) # prob of Louana winning up to n_draws print(paste0("Draw ", i, ": ", sum(l_win_tot))) # print sum of winning up to n_draws } ## [1] "Draw 1: 0.333333333333333" ## [1] "Draw 2: 0.5" ## [1] "Draw 3: 0.583333333333333" ## [1] "Draw 4: 0.625" ## [1] "Draw 5: 0.645833333333333" ## [1] "Draw 6: 0.65625" ## [1] "Draw 7: 0.661458333333333" ## [1] "Draw 8: 0.6640625" ## [1] "Draw 9: 0.665364583333333" ## [1] "Draw 10: 0.666015625" ## [1] "Draw 11: 0.666341145833333" ## [1] "Draw 12: 0.66650390625" ## [1] "Draw 13: 0.666585286458333" ## [1] "Draw 14: 0.6666259765625" ## [1] "Draw 15: 0.666646321614583" ## [1] "Draw 16: 0.666656494140625" ## [1] "Draw 17: 0.666661580403646" ## [1] "Draw 18: 0.666664123535156" ## [1] "Draw 19: 0.666665395100911" ## [1] "Draw 20: 0.666666030883789" ## [1] "Draw 21: 0.666666348775228" ## [1] "Draw 22: 0.666666507720947" ## [1] "Draw 23: 0.666666587193807" ## [1] "Draw 24: 0.666666626930237" ## [1] "Draw 25: 0.666666646798452" ## [1] "Draw 26: 0.666666656732559" ## [1] "Draw 27: 0.666666661699613" ## [1] "Draw 28: 0.66666666418314" ## [1] "Draw 29: 0.666666665424903" ## [1] "Draw 30: 0.666666666045785" ## [1] "Draw 31: 0.666666666356226" ## [1] "Draw 32: 0.666666666511446" ## [1] "Draw 33: 0.666666666589056" ## [1] "Draw 34: 0.666666666627862" ## [1] "Draw 35: 0.666666666647264" ## [1] "Draw 36: 0.666666666656966" ## [1] "Draw 37: 0.666666666661816" ## [1] "Draw 38: 0.666666666664241" ## [1] "Draw 39: 0.666666666665454" ## [1] "Draw 40: 0.66666666666606" ## [1] "Draw 41: 0.666666666666364" ## [1] "Draw 42: 0.666666666666515" ## [1] "Draw 43: 0.666666666666591" ## [1] "Draw 44: 0.666666666666629" ## [1] "Draw 45: 0.666666666666648" ## [1] "Draw 46: 0.666666666666657" ## [1] "Draw 47: 0.666666666666662" ## [1] "Draw 48: 0.666666666666664" ## [1] "Draw 49: 0.666666666666666" ## [1] "Draw 50: 0.666666666666666" ## [1] "Draw 51: 0.666666666666666" ## [1] "Draw 52: 0.666666666666667" ## [1] "Draw 53: 0.666666666666667" ## [1] "Draw 54: 0.666666666666667" ## [1] "Draw 55: 0.666666666666667" ## [1] "Draw 56: 0.666666666666667" ## [1] "Draw 57: 0.666666666666667" ## [1] "Draw 58: 0.666666666666667" ## [1] "Draw 59: 0.666666666666667" ## [1] "Draw 60: 0.666666666666667" ## [1] "Draw 61: 0.666666666666667" ## [1] "Draw 62: 0.666666666666667" ## [1] "Draw 63: 0.666666666666667" ## [1] "Draw 64: 0.666666666666667" ## [1] "Draw 65: 0.666666666666667" ## [1] "Draw 66: 0.666666666666667" ## [1] "Draw 67: 0.666666666666667" ## [1] "Draw 68: 0.666666666666667" ## [1] "Draw 69: 0.666666666666667" ## [1] "Draw 70: 0.666666666666667" ## [1] "Draw 71: 0.666666666666667" ## [1] "Draw 72: 0.666666666666667" ## [1] "Draw 73: 0.666666666666667" ## [1] "Draw 74: 0.666666666666667" ## [1] "Draw 75: 0.666666666666667" ## [1] "Draw 76: 0.666666666666667" ## [1] "Draw 77: 0.666666666666667" ## [1] "Draw 78: 0.666666666666667" ## [1] "Draw 79: 0.666666666666667" ## [1] "Draw 80: 0.666666666666667" ## [1] "Draw 81: 0.666666666666667" ## [1] "Draw 82: 0.666666666666667" ## [1] "Draw 83: 0.666666666666667" ## [1] "Draw 84: 0.666666666666667" ## [1] "Draw 85: 0.666666666666667" ## [1] "Draw 86: 0.666666666666667" ## [1] "Draw 87: 0.666666666666667" ## [1] "Draw 88: 0.666666666666667" ## [1] "Draw 89: 0.666666666666667" ## [1] "Draw 90: 0.666666666666667" ## [1] "Draw 91: 0.666666666666667" ## [1] "Draw 92: 0.666666666666667" ## [1] "Draw 93: 0.666666666666667" ## [1] "Draw 94: 0.666666666666667" ## [1] "Draw 95: 0.666666666666667" ## [1] "Draw 96: 0.666666666666667" ## [1] "Draw 97: 0.666666666666667" ## [1] "Draw 98: 0.666666666666667" ## [1] "Draw 99: 0.666666666666667" ## [1] "Draw 100: 0.666666666666667"
The probabilities found up to draw 1, 3 and 5 are coherent with what we found in the previous sections. Moreover, we see that from draw 52 onwards, the probability of Louana winning remains constant at 66.67%. This is referred as the limit.
Without printing the results of the above for loop
, we do not know how many draws are necessary to reach the limit.
Let’s now try to include the information about the number of necessary draws in order to avoid computing unnecessary draws:
n_draws < 9999 # initial number of draws, set intentionally to a high number p_tie < c() # set empty vector for prob of ties l_win_tot < c() # set empty vector for prob of Louana winning # find number of necessary draws: for (i in 1:n_draws) { p_tie[i] < tie^i # prob of tie for each draw limit_ndraws < sum(p_tie > 2.2e16) # number of necessary draws } # compute Louana winning probabilities with the smallest number of necessary draws for (i in 1:limit_ndraws) { l_win_tot[i] < ((tie^(i  1)) * l_win) # prob of Louana winning up to limited number of draws print(paste0("Draw ", i, ": ", sum(l_win_tot))) # sum of winning up to limited number of draws } ## [1] "Draw 1: 0.333333333333333" ## [1] "Draw 2: 0.5" ## [1] "Draw 3: 0.583333333333333" ## [1] "Draw 4: 0.625" ## [1] "Draw 5: 0.645833333333333" ## [1] "Draw 6: 0.65625" ## [1] "Draw 7: 0.661458333333333" ## [1] "Draw 8: 0.6640625" ## [1] "Draw 9: 0.665364583333333" ## [1] "Draw 10: 0.666015625" ## [1] "Draw 11: 0.666341145833333" ## [1] "Draw 12: 0.66650390625" ## [1] "Draw 13: 0.666585286458333" ## [1] "Draw 14: 0.6666259765625" ## [1] "Draw 15: 0.666646321614583" ## [1] "Draw 16: 0.666656494140625" ## [1] "Draw 17: 0.666661580403646" ## [1] "Draw 18: 0.666664123535156" ## [1] "Draw 19: 0.666665395100911" ## [1] "Draw 20: 0.666666030883789" ## [1] "Draw 21: 0.666666348775228" ## [1] "Draw 22: 0.666666507720947" ## [1] "Draw 23: 0.666666587193807" ## [1] "Draw 24: 0.666666626930237" ## [1] "Draw 25: 0.666666646798452" ## [1] "Draw 26: 0.666666656732559" ## [1] "Draw 27: 0.666666661699613" ## [1] "Draw 28: 0.66666666418314" ## [1] "Draw 29: 0.666666665424903" ## [1] "Draw 30: 0.666666666045785" ## [1] "Draw 31: 0.666666666356226" ## [1] "Draw 32: 0.666666666511446" ## [1] "Draw 33: 0.666666666589056" ## [1] "Draw 34: 0.666666666627862" ## [1] "Draw 35: 0.666666666647264" ## [1] "Draw 36: 0.666666666656966" ## [1] "Draw 37: 0.666666666661816" ## [1] "Draw 38: 0.666666666664241" ## [1] "Draw 39: 0.666666666665454" ## [1] "Draw 40: 0.66666666666606" ## [1] "Draw 41: 0.666666666666364" ## [1] "Draw 42: 0.666666666666515" ## [1] "Draw 43: 0.666666666666591" ## [1] "Draw 44: 0.666666666666629" ## [1] "Draw 45: 0.666666666666648" ## [1] "Draw 46: 0.666666666666657" ## [1] "Draw 47: 0.666666666666662" ## [1] "Draw 48: 0.666666666666664" ## [1] "Draw 49: 0.666666666666666" ## [1] "Draw 50: 0.666666666666666" ## [1] "Draw 51: 0.666666666666666" ## [1] "Draw 52: 0.666666666666667"
Remember that the initial question was:
What is the probability of winning the game for each ambassador?
The probability that Louana wins the game can easily be extracted as follows:
sum(l_win_tot) ## [1] 0.6666667
And since the game only stops when there is a winner, the sum of the winning probabilities for Louana and Colin must be equal to 1, so the probability that Colin wins the game is:
c_win_tot < 1  sum(l_win_tot) c_win_tot ## [1] 0.3333333
To summarize:
To visualize the probabilities for each ambassador, we miss the probabilities of winning for Colin so let’s compute them first:
c_win_tot < c() # set empty vector for prob of Colin winning # compute Colin winning probabilities with the smallest number of necessary draws for (i in 1:limit_ndraws) { c_win_tot[i] < ((tie^(i  1)) * c_win) # prob of Colin winning print(paste0("Draw ", i, ": ", sum(c_win_tot))) # print sum of winning } ## [1] "Draw 1: 0.166666666666667" ## [1] "Draw 2: 0.25" ## [1] "Draw 3: 0.291666666666667" ## [1] "Draw 4: 0.3125" ## [1] "Draw 5: 0.322916666666667" ## [1] "Draw 6: 0.328125" ## [1] "Draw 7: 0.330729166666667" ## [1] "Draw 8: 0.33203125" ## [1] "Draw 9: 0.332682291666667" ## [1] "Draw 10: 0.3330078125" ## [1] "Draw 11: 0.333170572916667" ## [1] "Draw 12: 0.333251953125" ## [1] "Draw 13: 0.333292643229167" ## [1] "Draw 14: 0.33331298828125" ## [1] "Draw 15: 0.333323160807292" ## [1] "Draw 16: 0.333328247070312" ## [1] "Draw 17: 0.333330790201823" ## [1] "Draw 18: 0.333332061767578" ## [1] "Draw 19: 0.333332697550456" ## [1] "Draw 20: 0.333333015441895" ## [1] "Draw 21: 0.333333174387614" ## [1] "Draw 22: 0.333333253860474" ## [1] "Draw 23: 0.333333293596903" ## [1] "Draw 24: 0.333333313465118" ## [1] "Draw 25: 0.333333323399226" ## [1] "Draw 26: 0.33333332836628" ## [1] "Draw 27: 0.333333330849806" ## [1] "Draw 28: 0.33333333209157" ## [1] "Draw 29: 0.333333332712452" ## [1] "Draw 30: 0.333333333022892" ## [1] "Draw 31: 0.333333333178113" ## [1] "Draw 32: 0.333333333255723" ## [1] "Draw 33: 0.333333333294528" ## [1] "Draw 34: 0.333333333313931" ## [1] "Draw 35: 0.333333333323632" ## [1] "Draw 36: 0.333333333328483" ## [1] "Draw 37: 0.333333333330908" ## [1] "Draw 38: 0.333333333332121" ## [1] "Draw 39: 0.333333333332727" ## [1] "Draw 40: 0.33333333333303" ## [1] "Draw 41: 0.333333333333182" ## [1] "Draw 42: 0.333333333333258" ## [1] "Draw 43: 0.333333333333295" ## [1] "Draw 44: 0.333333333333314" ## [1] "Draw 45: 0.333333333333324" ## [1] "Draw 46: 0.333333333333329" ## [1] "Draw 47: 0.333333333333331" ## [1] "Draw 48: 0.333333333333332" ## [1] "Draw 49: 0.333333333333333" ## [1] "Draw 50: 0.333333333333333" ## [1] "Draw 51: 0.333333333333333" ## [1] "Draw 52: 0.333333333333333"
We create a dataset with the probabilities for both ambassadors:
dat < data.frame( Draw = rep(1:limit_ndraws, 2), Probability = c(cumsum(l_win_tot), cumsum(c_win_tot)), Ambassador = c(rep("Louana", limit_ndraws), rep("Colin", limit_ndraws)) )
We can now visualize these probabilities up to the number of necessary draws:
# load package library(ggplot2) # plot ggplot(dat) + aes(x = Draw, y = Probability, colour = Ambassador) + geom_line(size = 2L) + labs( y = "Probability of winning", caption = "Source: KohLanta 2022" ) + scale_y_continuous(labels = scales::percent_format(accuracy = 1), limits = c(0, 1)) + theme_minimal()
But I believe the most appropriate plot to answer the initial question is with a barplot:
# create dataset dat_barplot < data.frame( Ambassador = c("Louana", "Colin"), Probability = c(sum(l_win_tot), sum(c_win_tot)) ) # plot ggplot(data = dat_barplot, aes(x = Ambassador, y = Probability)) + geom_bar(stat = "identity", fill = "steelblue") + labs( y = "Probability of winning", caption = "Source: KohLanta 2022" ) + scale_y_continuous(labels = scales::percent_format(accuracy = 1), limits = c(0, 1)) + theme_minimal()
Let’s try to implement this problem in a function to be able to reuse it with other initial probabilities.
With:
p_a
and p_b
denoting, respectively, the probability that ambassador A and ambassador B draw a white ball,n_draws
denoting the maximum number of draws that is allowed (default = 9999),we have the following function:
ambassadors_game < function(p_a, p_b, n_draws = 9999) { q_a < (1  p_a) q_b < (1  p_b) a_win < q_b * p_a b_win < q_a * p_b tie < (p_a * p_b) + (q_a * q_b) p_tie < c() # set empty vector for prob of ties a_win_tot < c() # set empty vector for prob of A winning b_win_tot < c() # set empty vector for prob of B winning # find number of necessary draws: for (i in 1:n_draws) { p_tie[i] < tie^i # prob of tie for each draw limit_ndraws < sum(p_tie > 2.2e16) # number of necessary draws } # compute A and B winning probabilities with the smallest number of necessary draws: for (i in 1:limit_ndraws) { a_win_tot[i] < ((tie^(i  1)) * a_win) # prob of A winning up to limited number of draws b_win_tot[i] < ((tie^(i  1)) * b_win) # prob of B winning up to limited number of draws } # save P(A), P(B) and number of necessary draws: res < list( "p_a" = sum(a_win_tot), "p_b" = sum(b_win_tot), "ndraws" = limit_ndraws ) # print results: return(res) }
We test the function to see if it matches results found above.
First, if both ambassadors have identical urns as it was the case before the 2022 season:
ambassadors_game( p_a = 1 / 2, p_b = 1 / 2 ) ## $p_a ## [1] 0.5 ## ## $p_b ## [1] 0.5 ## ## $ndraws ## [1] 52
The game is indeed fair, with a 50% chance of winning for each ambassador.
Second, with the urns presented to Louana and Colin, but for the first draw only (setting arbitrarily that Louana is ambassador A and Colin is ambassador B):
# first draw only ambassadors_game( p_a = 1 / 2, p_b = 1 / 3, n_draws = 1 ) ## $p_a ## [1] 0.3333333 ## ## $p_b ## [1] 0.1666667 ## ## $ndraws ## [1] 1
Third, still with the urns presented to Louana and Colin, but for a game limited to exactly 3 and 5 draws:
# up to 3 draws ambassadors_game( p_a = 1 / 2, p_b = 1 / 3, n_draws = 3 ) ## $p_a ## [1] 0.5833333 ## ## $p_b ## [1] 0.2916667 ## ## $ndraws ## [1] 3 # up to 5 draws ambassadors_game( p_a = 1 / 2, p_b = 1 / 3, n_draws = 5 ) ## $p_a ## [1] 0.6458333 ## ## $p_b ## [1] 0.3229167 ## ## $ndraws ## [1] 5
And now, the final verification with the real situation of Louana and Colin in KohLanta 2022:
# KohLanta 2022 situation out < ambassadors_game( p_a = 1 / 2, p_b = 1 / 3 ) out ## $p_a ## [1] 0.6666667 ## ## $p_b ## [1] 0.3333333 ## ## $ndraws ## [1] 52
All results match the ones presented above.
(Note that the probabilities for each ambassador can be extracted as follows:)
# prob ambassador A out$p_a ## [1] 0.6666667 # prob ambassador B out$p_b ## [1] 0.3333333
The initial question, coming from the television show KohLanta, was “what is the probability of winning for each participant if they play the ambassadors’ game?”.
In this article, we have shown how to compute these probabilities. Moreover, we have illustrated the process of how a simple probability problem could be generalized to suit many real life situations, and how a for loop
and a function could be used to implement a real life situation into R.
Last but not least, I would like to focus on something Denis Brogniart (the wellknown presenter of the show) said just after the two ambassadors came back to the island to announce their choice to the other adventurers. He mentioned that, due to the punishment afflicted to the yellow tribe, there was a difference of chances of 16% between Louana and Colin.
This comes naturally from the following two situations:
First situation:
Or, if the game is limited to the first draw only:
Second situation:
It is true that, seeing the problem from those two angles, the difference is ~16%. Denis Brogniart is right in mentioning this 16% difference.
However, the difference of chances between the two ambassadors is larger if we see the game from a broader perspective. The rules of the game say that it stops only when there is a winner. From that point of view, as demonstrated above:
In this situation, the difference in probability of winning between the 2 ambassadors is 66.67  33.33 = 33.34%!
I must admit that I was not expecting such a large difference between the two ambassadors. Seeing the ambassadors’ game from that perspective looks different to what we were told, or to what we thought before actually computing the probabilities. (Denis Brogniart, if you happen to read this, feel free to let me know whether this relatively large difference was intended or not.)
For those of you who do not watch the show: upon the return of the ambassadors on the island with all adventurers, Colin has been heavily criticized by the members of his tribe. His decision not to play the ambassadors’ game (and the decision to eliminate one member of his tribe with the aim of saving himself) was seen as a betrayal. As a consequence of this, he was eliminated by the reunited tribe directly after that episode.
I am going to conclude this article with the following question:
If you were an ambassador in the 2022 KohLanta season and given that now you know the exact probabilites of winning for each tribe, what would you have done?
Thanks for reading.
As always, if you have any question related to the topic covered in this article, please add it as a comment so other readers can benefit from the discussion.
We will never know if he would have played the game if the chances of winning were equal for both ambassadors. Since I started to watch this television show, I have never seen any ambassadors’ negotiation leading to the ambassadors’ game, they all ended with the designation of an adventurer.︎
And to be honest, I wanted to compute these probabilities because while I was watching the show with my girlfriend, she looked at me and asked “what are the probabilities for each of them?”. I am now able to give her a precise answer.︎
Following our recent post on “Safeguards and Backups for GitHub Organizations”, nearly one month ago we went one step further and made twofactor authentication (2FA) required for all members and outside collaborators of our main organization, ropensci
.
It was a timely decision as GitHub since then announced it will require all users who contribute code on GitHub.com to enable one or more forms of twofactor authentication (2FA) by the end of 2023.
Here is how (and why) we went about it.
We used to only require twofactor authentication of organization owners (which is not an actual setting of GitHub, just a rule we set for ourselves).
However, requiring 2FA for the whole organization seemed like a logical step towards more security.
Hopefully it also inspires more 2FA adoption beyond the ropensci
organization as new adopters of the setting can tell their collaborators about it.
When one starts requiring 2FA for an organization, all members and outside collaborators who have not enabled it are removed from the organization and receive a notification from GitHub.
One aspect we pondered was whether it’d be potentially unfair to require 2FA. Many 2FA systems rely on the user having a mobile device, which could be a barrier for some. However, GitHub provides many different 2FA methods (not only those requiring a mobile device), so in the end we decided to go for it, but to make careful note of feedback from organization members and collaborators.
As recommended in GitHub docs, we communicated about the change in advance, sending emails to all organization members and outside collaborators without 2FA two weeks before the switch. This was meant to ensure that as few people as possible lost access to their repositories.
Email addresses were collected via the GitHub API, and for those who do not have a public email address on GitHub, using a search engine as well as email addresses used for recent commits. Taking the time to do so was also crucial to, again, remove as few people as possible from the organization.
We sent emails using the gmailr package. We manually went through automatic responses to decide action (e.g. scheduling a new email after someone was back from vacation, looking for a newer email address).
Here’s the text we used:
Dear ropensci GitHub organization member, To increase the security of the `ropensci` GitHub organization, we plan to make twofactor authentication ("2FA") required for members on April 18th. On this date, you will lose access to any ropensci GitHub repository you previously had access to. Docs: https://docs.github.com/en/authentication/securingyouraccountwithtwofactorauthentication2fa/configuringtwofactorauthentication Obviously we encourage anyone to enable 2FA if you had not done so yet. Please email me or info@ropensci.org if you have:  any question regarding 2FA,  enabled 2FA after April 18th and would like to added back. I will be glad to help! Thanks and best wishes, > Maëlle on behalf of the rOpenSci team.
With the subject [Action required] Enable GitHub TwoFactor Authentication to Preserve Access
.
We also posted a message about the change in our Slack workspace, as well as quite a few Slack DMs in particular to package maintainers.
Note that we actually had to send two emails as we first misread the documentation and thought members could become outside collaborators without 2FA. That’s not true! Thanks to organization members and outside collaborators for their patience.
To see respectively removed organization members and collaborators, as organization owners we were able to use the URLs
https://github.com/organizations/<orgname>/settings/auditlog?q=action%3Aorg.remove_outside_collaborator
;https://github.com/organizations/<orgname>/settings/auditlog?q=action%3Aorg.remove_member
.Before confirming the 2FA requirement, GitHub actually shows you which members you are about to remove from the organization. We went ahead anyway as we assumed the GitHub notification might reach some of them better than our previous email, and as the removal of people is reversible as soon as they have enabled 2FA.
After we pulled the switch on April 19th, a few people reached out to tell us they had enabled 2FA. It was straightforward to add them back to the organization as GitHub lets you choose to reinstate an organization member within three months of their removal, so they get added to the teams and repos they used to belong to. We strove to reinstate people timely.
If you were an organization member or outside collaborator who was removed from the organization, please reach out to us if you have enabled 2FA and would like to be reinstated, or if you have any question about 2FA.
Twofactor authentication is now required for our main GitHub organization. It was a change made for the best for most people, but which might have created some pain and frustration for a few people depending on when they got the notification. We thank everyone for their collaboration and understanding.
As a further GitHub security step for us and you, dear reader, let’s mention
And more generally, we recommend the article Ten quick tips for staying safe online by Danielle Smalls and Greg Wilson.
Often times before crucial matches, or in general, we would like to know the performance of a batsman against a bowler or viceversa, but we may not have the data. We generally have data where different batsmen would have faced different sets of bowlers with certain performance data like ballsFaced, totalRuns, fours, sixes, strike rate and timesOut. Similarly different bowlers would have performance figures(deliveries, runsConceded, economyRate and wicketTaken) against different sets of batsmen. We will never have the data for all batsmen against all bowlers. However, it would be good estimate the performance of batsmen against a bowler, even though we do not have the performance data. This could be done using collaborative filtering which identifies and computes based on the similarity between batsmen vs bowlers & bowlers vs batsmen.
This post shows an approach whereby we can estimate a batsman’s performance against bowlers even though the batsman may not have faced those bowlers, based on his/her performance against other bowlers. It also estimates the performance of bowlers against batsmen using the same approach. This is based on the recommender algorithm which is used to recommend products to customers based on their rating on other products.
This idea came to me while generating the performance of batsmen vs bowlers & viceversa for 2 IPL teams in this IPL 2022 with my Shiny app GooglyPlusPlus. I found that there were some batsmen for which there was no data against certain bowlers, probably because they are playing for the first time in their team or because they were new. While pondering on this problem, I realized that this problem formulation is similar to the problem formulation for the famous Netflix movie recommendation problem, in which user’s ratings for certain movies are known and based on these ratings, the recommender engine can generate ratings for movies not yet seen.
This post estimates a player’s (batsman/bowler) using the recommender engine This post is based on R package recommenderlab
“Michael Hahsler (2021). recommenderlab: Lab for Developing and Testing Recommender Algorithms. R package version 0.27. https://github.com/mhahsler/recommenderlab”
Note 1: Thw data for this analysis is taken from Cricsheet after being processed by my R package yorkr.
You can also read this post in RPubs at Player Performance Estimation using AI Collaborative Filtering
A PDF copy of this post is available at Player Performance Estimation using AI Collaborative Filtering.pdf
You can download this R Markdown file and the associated data and perform the analysis yourself using any other recommender engine from Github at playerPerformanceEstimation
In the table below we see a set of bowlers vs a set of batsmen and the number of times the bowlers got these batsmen out.
By knowing the performance of the bowlers against some of the batsmen we can use collaborative filter to determine the missing values. This is done using the recommender engine.
The Recommender Engine works as follows. Let us say that there are feature vectors , and for the 3 bowlers which identify the characteristics of these bowlers (“fast”, “lateral drift through the air”, “movement off the pitch”). Let each batsman be identified by parameter vectors , and so on
For e.g. consider the following table
Then by assuming an initial estimate for the parameter vector and the feature vector xx we can formulate this as an optimization problem which tries to minimize the error for This can work very well as the algorithm can determine features which cannot be captured. So for e.g. some particular bowler may have very impressive figures. This could be due to some aspect of the bowling which cannot be captured by the data for e.g. let’s say the bowler uses the ‘scrambled seam’ when he is most effective, with a slightly different arc to the flight. Though the algorithm cannot identify the feature as we know it, but the ML algorithm should pick up intricacies which cannot be captured in data.
Hence the algorithm can be quite effective.
Note: The recommender lab performance is not very good and the Mean Square Error is quite high. Also, the ROC and AUC curves show that not in aLL cases the algorithm is doing a clean job of separating the True positives (TPR) from the False Positives (FPR)
Note: This is similar to the recommendation problem
The collaborative optimization object can be considered as a minimization of both and the features x and can be written as
J(, }= 1/2
The collaborative filtering algorithm can be summarized as follows
The above derivation for the recommender problem is taken from Machine Learning by Prof Andrew Ng at Coursera from the lecture Collaborative filtering
There are 2 main types of Collaborative Filtering(CF) approaches
A small note on interpreting ROC & PrecisionRecall curves in the post below
ROC Curve: The ROC curve plots the True Positive Rate (TPR) against the False Positive Rate (FPR). Ideally the TPR should increase faster than the FPR and the AUC (area under the curve) should be close to 1
PrecisionRecall: The precisionrecall curve shows the tradeoff between precision and recall for different threshold. A high area under the curve represents both high recall and high precision, where high precision relates to a low false positive rate, and high recall relates to a low false negative rate
library(reshape2) library(dplyr) library(ggplot2) library(recommenderlab) library(tidyr) load("recom_data/batsmenVsBowler20_22.rdata")
Helper functions for the RMarkdown notebook are created
# This function returns the error for the chosen algorithm and also predicts the estimates # for the given data eval < function(data, train1, k1,given1,goodRating1,recomType1="UBCF"){ set.seed(2022) e< evaluationScheme(data, method = "split", train = train1, k = k1, given = given1, goodRating = goodRating1) r1 < Recommender(getData(e, "train"), recomType1) print(r1) p1 < predict(r1, getData(e, "known"), type="ratings") print(p1) error = calcPredictionAccuracy(p1, getData(e, "unknown")) print(error) p2 < predict(r1, data, type="ratingMatrix") p2 } # This function will evaluate the different recommender algorithms and plot the AUC and ROC curves evalRecomMethods < function(data,k1,given1,goodRating1){ set.seed(2022) e< evaluationScheme(data, method = "cross", k = k1, given = given1, goodRating = goodRating1) models_to_evaluate < list( `IBCF Cosinus` = list(name = "IBCF", param = list(method = "cosine")), `IBCF Pearson` = list(name = "IBCF", param = list(method = "pearson")), `UBCF Cosinus` = list(name = "UBCF", param = list(method = "cosine")), `UBCF Pearson` = list(name = "UBCF", param = list(method = "pearson")), `Zufälliger Vorschlag` = list(name = "RANDOM", param=NULL) ) n_recommendations < c(1, 5, seq(10, 100, 10)) list_results < evaluate(x = e, method = models_to_evaluate, n = n_recommendations) plot(list_results, annotate=c(1,3), legend="bottomright") plot(list_results, "prec/rec", annotate=3, legend="topleft") }
The section below regenerates the performance for batsmen based on incomplete data for the different fields in the data frame namely balls faced, fours, sixes, strike rate, times out. The recommender lab allows one to test several different algorithms all at once namely
head(df) ## batsman1 bowler1 ballsFaced totalRuns fours sixes SR timesOut ## 1 A Badoni A Mishra 0 0 0 0 NaN 0 ## 2 A Badoni A Nortje 0 0 0 0 NaN 0 ## 3 A Badoni A Zampa 0 0 0 0 NaN 0 ## 4 A Badoni Abdul Samad 0 0 0 0 NaN 0 ## 5 A Badoni Abhishek Sharma 0 0 0 0 NaN 0 ## 6 A Badoni AD Russell 0 0 0 0 NaN 0
For this analysis the data from Cricsheet has been processed using my R package yorkr to obtain the following 2 data sets – batsmenVsBowler – This dataset will contain the performance of the batsmen against the bowler and will capture a) ballsFaced b) totalRuns c) Fours d) Sixes e) SR f) timesOut – bowlerVsBatsmen – This data set will contain the performance of the bowler against the difference batsmen and will include a) deliveries b) runsConceded c) EconomyRate d) wicketsTaken
Obviously many rows/columns will be empty
This is a large data set and hence I have filtered for the period > Jan 2020 and < Dec 2022 which gives 2 datasets a) batsmanVsBowler20_22.rdata b) bowlerVsBatsman20_22.rdata
I also have 2 other datasets of all batsmen and bowlers in these 2 dataset in the files c) allbatsmen20_22.rds d) allbowlers20_22.rds
You can download the data and this RMarkdown notebook from Github at PlayerPerformanceEstimation
Feel free to download and analyze the data and use any recommendation engine you choose
Initially an exploratory analysis is done on the data
df3 < select(df, batsman1,bowler1,timesOut) df6 < xtabs(timesOut ~ ., df3) df7 < as.data.frame.matrix(df6) df8 < data.matrix(df7) df8[df8 == 0] < NA print(df8[1:10,1:10]) ## A Mishra A Nortje A Zampa Abdul Samad Abhishek Sharma ## A Badoni NA NA NA NA NA ## A Manohar NA NA NA NA NA ## A Nortje NA NA NA NA NA ## AB de Villiers NA 4 3 NA NA ## Abdul Samad NA NA NA NA NA ## Abhishek Sharma NA NA NA NA NA ## AD Russell 1 NA NA NA NA ## AF Milne NA NA NA NA NA ## AJ Finch NA NA NA NA 3 ## AJ Tye NA NA NA NA NA ## AD Russell AF Milne AJ Tye AK Markram Akash Deep ## A Badoni NA NA NA NA NA ## A Manohar NA NA NA NA NA ## A Nortje NA NA NA NA NA ## AB de Villiers 3 NA 3 NA NA ## Abdul Samad NA NA NA NA NA ## Abhishek Sharma NA NA NA NA NA ## AD Russell NA NA 6 NA NA ## AF Milne NA NA NA NA NA ## AJ Finch NA NA NA NA NA ## AJ Tye NA NA NA NA NA
The dots below represent data for which there is no performance data. These cells need to be estimated by the algorithm
set.seed(2022) r < as(df8,"realRatingMatrix") getRatingMatrix(r)[1:15,1:15] ## 15 x 15 sparse Matrix of class "dgCMatrix" ## [[ suppressing 15 column names 'A Mishra', 'A Nortje', 'A Zampa' ... ]] ## ## A Badoni . . . . . . . . . . . . . . . ## A Manohar . . . . . . . . . . . . . . . ## A Nortje . . . . . . . . . . . . . . . ## AB de Villiers . 4 3 . . 3 . 3 . . . 4 3 . . ## Abdul Samad . . . . . . . . . . . . . . . ## Abhishek Sharma . . . . . . . . . . . 1 . . . ## AD Russell 1 . . . . . . 6 . . . 3 3 3 . ## AF Milne . . . . . . . . . . . . . . . ## AJ Finch . . . . 3 . . . . . . 1 . . . ## AJ Tye . . . . . . . . . . . 1 . . . ## AK Markram . . . 3 . . . . . . . . . . . ## AM Rahane 9 . . . . 3 . 3 . . . 3 3 . . ## Anmolpreet Singh . . . . . . . . . . . . . . . ## Anuj Rawat . . . . . . . . . . . . . . . ## AR Patel . . . . . . . 1 . . . . . . . r0=r[(rowCounts(r) > 10),] getRatingMatrix(r0)[1:15,1:15] ## 15 x 15 sparse Matrix of class "dgCMatrix" ## [[ suppressing 15 column names 'A Mishra', 'A Nortje', 'A Zampa' ... ]] ## ## AB de Villiers . 4 3 . . 3 . 3 . . . 4 3 . . ## Abdul Samad . . . . . . . . . . . . . . . ## Abhishek Sharma . . . . . . . . . . . 1 . . . ## AD Russell 1 . . . . . . 6 . . . 3 3 3 . ## AJ Finch . . . . 3 . . . . . . 1 . . . ## AM Rahane 9 . . . . 3 . 3 . . . 3 3 . . ## AR Patel . . . . . . . 1 . . . . . . . ## AT Rayudu 2 . . . . . 1 . . . . 3 . . . ## B Kumar 3 . 3 . . . . . . . . . . 3 . ## BA Stokes . . . . . . 3 4 . . . 3 . . . ## CA Lynn . . . . . . . 9 . . . 3 . . . ## CH Gayle . . . . . 6 . 3 . . . 6 . . . ## CH Morris . 3 . . . . . . . . . 3 . . . ## D Padikkal . 4 . . . 3 . . . . . . 3 . . ## DA Miller . . . . . 3 . . . . . 3 . . . # Get the summary of the data summary(getRatings(r0)) ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 1.000 3.000 3.000 3.463 4.000 21.000 # Normalize the data r0_m < normalize(r0) getRatingMatrix(r0_m)[1:15,1:15] ## 15 x 15 sparse Matrix of class "dgCMatrix" ## [[ suppressing 15 column names 'A Mishra', 'A Nortje', 'A Zampa' ... ]] ## ## AB de Villiers . 0.7857143 1.7857143 . . 1.7857143 ## Abdul Samad . . . . . . ## Abhishek Sharma . . . . . . ## AD Russell 2.6562500 . . . . . ## AJ Finch . . . . 0.03125 . ## AM Rahane 4.6041667 . . . . 1.3958333 ## AR Patel . . . . . . ## AT Rayudu 2.1363636 . . . . . ## B Kumar 0.3636364 . 0.3636364 . . . ## BA Stokes . . . . . . ## CA Lynn . . . . . . ## CH Gayle . . . . . 1.5476190 ## CH Morris . 0.3500000 . . . . ## D Padikkal . 0.6250000 . . . 0.3750000 ## DA Miller . . . . . 0.7037037 ## ## AB de Villiers . 1.7857143 . . . 0.7857143 1.785714 . . ## Abdul Samad . . . . . . . . . ## Abhishek Sharma . . . . . 1.6000000 . . . ## AD Russell . 2.3437500 . . . 0.6562500 0.656250 0.6562500 . ## AJ Finch . . . . . 2.0312500 . . . ## AM Rahane . 1.3958333 . . . 1.3958333 1.395833 . . ## AR Patel . 2.3333333 . . . . . . . ## AT Rayudu 3.1363636 . . . . 1.1363636 . . . ## B Kumar . . . . . . . 0.3636364 . ## BA Stokes 0.6086957 0.3913043 . . . 0.6086957 . . . ## CA Lynn . 5.3200000 . . . 0.6800000 . . . ## CH Gayle . 1.4523810 . . . 1.5476190 . . . ## CH Morris . . . . . 0.3500000 . . . ## D Padikkal . . . . . . 0.375000 . . ## DA Miller . . . . . 0.7037037 . . .
The histograms show the bias in the data is removed after normalization
r0=r[(m=rowCounts(r) > 10),] getRatingMatrix(r0)[1:15,1:10] ## 15 x 10 sparse Matrix of class "dgCMatrix" ## [[ suppressing 10 column names 'A Mishra', 'A Nortje', 'A Zampa' ... ]] ## ## AB de Villiers . 4 3 . . 3 . 3 . . ## Abdul Samad . . . . . . . . . . ## Abhishek Sharma . . . . . . . . . . ## AD Russell 1 . . . . . . 6 . . ## AJ Finch . . . . 3 . . . . . ## AM Rahane 9 . . . . 3 . 3 . . ## AR Patel . . . . . . . 1 . . ## AT Rayudu 2 . . . . . 1 . . . ## B Kumar 3 . 3 . . . . . . . ## BA Stokes . . . . . . 3 4 . . ## CA Lynn . . . . . . . 9 . . ## CH Gayle . . . . . 6 . 3 . . ## CH Morris . 3 . . . . . . . . ## D Padikkal . 4 . . . 3 . . . . ## DA Miller . . . . . 3 . . . . #Plot ratings image(r0, main = "Raw Ratings")
#Plot normalized ratings r0_m < normalize(r0) getRatingMatrix(r0_m)[1:15,1:15] ## 15 x 15 sparse Matrix of class "dgCMatrix" ## [[ suppressing 15 column names 'A Mishra', 'A Nortje', 'A Zampa' ... ]] ## ## AB de Villiers . 0.7857143 1.7857143 . . 1.7857143 ## Abdul Samad . . . . . . ## Abhishek Sharma . . . . . . ## AD Russell 2.6562500 . . . . . ## AJ Finch . . . . 0.03125 . ## AM Rahane 4.6041667 . . . . 1.3958333 ## AR Patel . . . . . . ## AT Rayudu 2.1363636 . . . . . ## B Kumar 0.3636364 . 0.3636364 . . . ## BA Stokes . . . . . . ## CA Lynn . . . . . . ## CH Gayle . . . . . 1.5476190 ## CH Morris . 0.3500000 . . . . ## D Padikkal . 0.6250000 . . . 0.3750000 ## DA Miller . . . . . 0.7037037 ## ## AB de Villiers . 1.7857143 . . . 0.7857143 1.785714 . . ## Abdul Samad . . . . . . . . . ## Abhishek Sharma . . . . . 1.6000000 . . . ## AD Russell . 2.3437500 . . . 0.6562500 0.656250 0.6562500 . ## AJ Finch . . . . . 2.0312500 . . . ## AM Rahane . 1.3958333 . . . 1.3958333 1.395833 . . ## AR Patel . 2.3333333 . . . . . . . ## AT Rayudu 3.1363636 . . . . 1.1363636 . . . ## B Kumar . . . . . . . 0.3636364 . ## BA Stokes 0.6086957 0.3913043 . . . 0.6086957 . . . ## CA Lynn . 5.3200000 . . . 0.6800000 . . . ## CH Gayle . 1.4523810 . . . 1.5476190 . . . ## CH Morris . . . . . 0.3500000 . . . ## D Padikkal . . . . . . 0.375000 . . ## DA Miller . . . . . 0.7037037 . . . image(r0_m, main = "Normalized Ratings")
set.seed(1234) hist(getRatings(r0), breaks=25)
hist(getRatings(r0_m), breaks=25)
The data frame of the batsman vs bowlers from the period 2020 2022 is read as a dataframe. To remove rows with very low number of ratings(timesOut, SR, Fours, Sixes etc), the rows are filtered so that there are at least more 10 values in the row. For the player estimation the dataframe is converted into a wideformat as a matrix (m x n) of batsman x bowler with each of the columns of the dataframe i.e. timesOut, SR, fours or sixes. These different matrices can be considered as a rating matrix for estimation.
A similar approach is taken for estimating bowler performance. Here a wide form matrix (m x n) of bowler x batsman is created for each of the columns of deliveries, runsConceded, ER, wicketsTaken
The code below estimates the number of times the batsmen would lose his/her wicket to the bowler. As discussed in the algorithm above, the recommendation engine will make an initial estimate features for the bowler and an initial estimate for the parameter vector for the batsmen. Then using gradient descent the recommender engine will determine the feature and parameter values such that the over Mean Squared Error is minimum
From the plot for the different algorithms it can be seen that UBCF performs the best. However the AUC & ROC curves are not optimal and the AUC> 0.5
df3 < select(df, batsman1,bowler1,timesOut) df6 < xtabs(timesOut ~ ., df3) df7 < as.data.frame.matrix(df6) df8 < data.matrix(df7) df8[df8 == 0] < NA r < as(df8,"realRatingMatrix") # Filter only rows where the row count is > 10 r0=r[(rowCounts(r) > 10),] getRatingMatrix(r0)[1:10,1:10] ## 10 x 10 sparse Matrix of class "dgCMatrix" ## [[ suppressing 10 column names 'A Mishra', 'A Nortje', 'A Zampa' ... ]] ## ## AB de Villiers . 4 3 . . 3 . 3 . . ## Abdul Samad . . . . . . . . . . ## Abhishek Sharma . . . . . . . . . . ## AD Russell 1 . . . . . . 6 . . ## AJ Finch . . . . 3 . . . . . ## AM Rahane 9 . . . . 3 . 3 . . ## AR Patel . . . . . . . 1 . . ## AT Rayudu 2 . . . . . 1 . . . ## B Kumar 3 . 3 . . . . . . . ## BA Stokes . . . . . . 3 4 . . summary(getRatings(r0)) ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 1.000 3.000 3.000 3.463 4.000 21.000 # Evaluate the different plotting methods evalRecomMethods(r0[1:dim(r0)[1]],k1=5,given=7,goodRating1=median(getRatings(r0)))
#Evaluate the error a=eval(r0[1:dim(r0)[1]],0.8,k1=5,given1=7,goodRating1=median(getRatings(r0)),"UBCF") ## Recommender of type 'UBCF' for 'realRatingMatrix' ## learned using 70 users. ## 18 x 145 rating matrix of class 'realRatingMatrix' with 1755 ratings. ## RMSE MSE MAE ## 2.069027 4.280872 1.496388 b=round(as(a,"matrix")[1:10,1:10]) c < as(b,"realRatingMatrix") m=as(c,"data.frame") names(m) =c("batsman","bowler","TimesOut")
This section deals with the Strike rate of batsmen versus bowlers and estimates the values for those where the data is incomplete using UBCF method.
Even here all the algorithms do not perform too efficiently. I did try out a few variations but could not lower the error (suggestions welcome!!)
df3 < select(df, batsman1,bowler1,SR) df6 < xtabs(SR ~ ., df3) df7 < as.data.frame.matrix(df6) df8 < data.matrix(df7) df8[df8 == 0] < NA r < as(df8,"realRatingMatrix") r0=r[(rowCounts(r) > 10),] getRatingMatrix(r0)[1:10,1:10] ## 10 x 10 sparse Matrix of class "dgCMatrix" ## [[ suppressing 10 column names 'A Mishra', 'A Nortje', 'A Zampa' ... ]] ## ## AB de Villiers 96.8254 171.4286 33.33333 . 66.66667 223.07692 . ## Abdul Samad . 228.0000 . . . 100.00000 . ## Abhishek Sharma 150.0000 . . . . 66.66667 . ## AD Russell 111.4286 . . . . . . ## AJ Finch 250.0000 116.6667 . . 50.00000 85.71429 112.5000 ## AJ Tye . . . . . . 100.0000 ## AK Markram . . . 50 . . . ## AM Rahane 121.1111 . . . . 113.82979 117.9487 ## AR Patel 183.3333 . 200.00000 . . 433.33333 . ## AT Rayudu 126.5432 200.0000 122.22222 . . 105.55556 . ## ## AB de Villiers 109.52381 . . ## Abdul Samad . . . ## Abhishek Sharma . . . ## AD Russell 195.45455 . . ## AJ Finch . . . ## AJ Tye . . . ## AK Markram . . . ## AM Rahane 33.33333 . 200 ## AR Patel 171.42857 . . ## AT Rayudu 204.76190 . . summary(getRatings(r0)) ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 5.882 85.714 116.667 128.529 160.606 600.000 evalRecomMethods(r0[1:dim(r0)[1]],k1=5,given=7,goodRating1=median(getRatings(r0)))
a=eval(r0[1:dim(r0)[1]],0.8, k1=5,given1=7,goodRating1=median(getRatings(r0)),"UBCF") ## Recommender of type 'UBCF' for 'realRatingMatrix' ## learned using 105 users. ## 27 x 145 rating matrix of class 'realRatingMatrix' with 3220 ratings. ## RMSE MSE MAE ## 77.71979 6040.36508 58.58484 b=round(as(a,"matrix")[1:10,1:10]) c < as(b,"realRatingMatrix") n=as(c,"data.frame") names(n) =c("batsman","bowler","SR")
The snippet of code estimes the sixes of the batsman against bowlers. The ROC and AUC curve for UBCF looks a lot better here, as it significantly greater than 0.5
df3 < select(df, batsman1,bowler1,sixes) df6 < xtabs(sixes ~ ., df3) df7 < as.data.frame.matrix(df6) df8 < data.matrix(df7) df8[df8 == 0] < NA r < as(df8,"realRatingMatrix") r0=r[(rowCounts(r) > 10),] getRatingMatrix(r0)[1:10,1:10] ## 10 x 10 sparse Matrix of class "dgCMatrix" ## [[ suppressing 10 column names 'A Mishra', 'A Nortje', 'A Zampa' ... ]] ## ## AB de Villiers 3 3 . . . 18 . 3 . . ## AD Russell 3 . . . . . . 12 . . ## AJ Finch 2 . . . . . . . . . ## AM Rahane 7 . . . . 3 1 . . . ## AR Patel 4 . 3 . . 6 . 1 . . ## AT Rayudu 5 2 . . . . . 1 . . ## BA Stokes . . . . . . . . . . ## CA Lynn . . . . . . . 9 . . ## CH Gayle 17 . . . . 17 . . . . ## CH Morris . . 3 . . . . . . . summary(getRatings(r0)) ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 1.00 3.00 3.00 4.68 6.00 33.00 evalRecomMethods(r0[1:dim(r0)[1]],k1=5,given=7,goodRating1=median(getRatings(r0))) ## Timing stopped at: 0.003 0 0.002
a=eval(r0[1:dim(r0)[1]],0.8, k1=5,given1=7,goodRating1=median(getRatings(r0)),"UBCF") ## Recommender of type 'UBCF' for 'realRatingMatrix' ## learned using 52 users. ## 14 x 145 rating matrix of class 'realRatingMatrix' with 1634 ratings. ## RMSE MSE MAE ## 3.529922 12.460350 2.532122 b=round(as(a,"matrix")[1:10,1:10]) c < as(b,"realRatingMatrix") o=as(c,"data.frame") names(o) =c("batsman","bowler","Sixes")
The code below estimates 4s for the batsmen
df3 < select(df, batsman1,bowler1,fours) df6 < xtabs(fours ~ ., df3) df7 < as.data.frame.matrix(df6) df8 < data.matrix(df7) df8[df8 == 0] < NA r < as(df8,"realRatingMatrix") r0=r[(rowCounts(r) > 10),] getRatingMatrix(r0)[1:10,1:10] ## 10 x 10 sparse Matrix of class "dgCMatrix" ## [[ suppressing 10 column names 'A Mishra', 'A Nortje', 'A Zampa' ... ]] ## ## AB de Villiers . 1 . . . 24 . 3 . . ## Abhishek Sharma . . . . . . . . . . ## AD Russell 1 . . . . . . 9 . . ## AJ Finch . 1 . . . 3 2 . . . ## AK Markram . . . . . . . . . . ## AM Rahane 11 . . . . 8 7 . . 3 ## AR Patel . . . . . . . 3 . . ## AT Rayudu 11 2 3 . . 6 . 6 . . ## BA Stokes 1 . . . . . . . . . ## CA Lynn . . . . . . . 6 . . summary(getRatings(r0)) ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 1.000 3.000 4.000 6.339 9.000 55.000 evalRecomMethods(r0[1:dim(r0)[1]],k1=5,given=7,goodRating1=median(getRatings(r0))) ## Timing stopped at: 0.008 0 0.008 ## Warning in .local(x, method, ...): ## Recommender 'UBCF Pearson' has failed and has been removed from the results!
a=eval(r0[1:dim(r0)[1]],0.8, k1=5,given1=7,goodRating1=median(getRatings(r0)),"UBCF") ## Recommender of type 'UBCF' for 'realRatingMatrix' ## learned using 67 users. ## 17 x 145 rating matrix of class 'realRatingMatrix' with 2083 ratings. ## RMSE MSE MAE ## 5.486661 30.103447 4.060990 b=round(as(a,"matrix")[1:10,1:10]) c < as(b,"realRatingMatrix") p=as(c,"data.frame") names(p) =c("batsman","bowler","Fours")
The code below estimates the total runs that would have scored by the batsman against different bowlers
df3 < select(df, batsman1,bowler1,totalRuns) df6 < xtabs(totalRuns ~ ., df3) df7 < as.data.frame.matrix(df6) df8 < data.matrix(df7) df8[df8 == 0] < NA r < as(df8,"realRatingMatrix") r0=r[(rowCounts(r) > 10),] getRatingMatrix(r)[1:10,1:10] ## 10 x 10 sparse Matrix of class "dgCMatrix" ## [[ suppressing 10 column names 'A Mishra', 'A Nortje', 'A Zampa' ... ]] ## ## A Badoni . . . . . . . . . . ## A Manohar . . . . . . . . . . ## A Nortje . . . . . . . . . . ## AB de Villiers 61 36 3 . 6 261 . 69 . . ## Abdul Samad . 57 . . . 12 . . . . ## Abhishek Sharma 3 . . . . 6 . . . . ## AD Russell 39 . . . . . . 129 . . ## AF Milne . . . . . . . . . . ## AJ Finch 15 7 . . 3 18 9 . . . ## AJ Tye . . . . . . 4 . . . summary(getRatings(r0)) ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 1.00 9.00 24.00 41.36 54.00 452.00 evalRecomMethods(r0[1:dim(r0)[1]],k1=5,given1=7,goodRating1=median(getRatings(r0)))
a=eval(r0[1:dim(r0)[1]],0.8, k1=5,given1=7,goodRating1=median(getRatings(r0)),"UBCF") ## Recommender of type 'UBCF' for 'realRatingMatrix' ## learned using 105 users. ## 27 x 145 rating matrix of class 'realRatingMatrix' with 3256 ratings. ## RMSE MSE MAE ## 41.50985 1723.06788 29.52958 b=round(as(a,"matrix")[1:10,1:10]) c < as(b,"realRatingMatrix") q=as(c,"data.frame") names(q) =c("batsman","bowler","TotalRuns")
The snippet estimates the balls faced by batsmen versus bowlers
df3 < select(df, batsman1,bowler1,ballsFaced) df6 < xtabs(ballsFaced ~ ., df3) df7 < as.data.frame.matrix(df6) df8 < data.matrix(df7) df8[df8 == 0] < NA r < as(df8,"realRatingMatrix") r0=r[(rowCounts(r) > 10),] getRatingMatrix(r)[1:10,1:10] ## 10 x 10 sparse Matrix of class "dgCMatrix" ## [[ suppressing 10 column names 'A Mishra', 'A Nortje', 'A Zampa' ... ]] ## ## A Badoni . . . . . . . . . . ## A Manohar . . . . . . . . . . ## A Nortje . . . . . . . . . . ## AB de Villiers 63 21 9 . 9 117 . 63 . . ## Abdul Samad . 25 . . . 12 . . . . ## Abhishek Sharma 2 . . . . 9 . . . . ## AD Russell 35 . . . . . . 66 . . ## AF Milne . . . . . . . . . . ## AJ Finch 6 6 . . 6 21 8 . . . ## AJ Tye . . . . . 9 4 . . . summary(getRatings(r0)) ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 1.00 9.00 18.00 30.21 39.00 384.00 evalRecomMethods(r0[1:dim(r0)[1]],k1=5,given=7,goodRating1=median(getRatings(r0)))
a=eval(r0[1:dim(r0)[1]],0.8, k1=5,given1=7,goodRating1=median(getRatings(r0)),"UBCF") ## Recommender of type 'UBCF' for 'realRatingMatrix' ## learned using 112 users. ## 28 x 145 rating matrix of class 'realRatingMatrix' with 3378 ratings. ## RMSE MSE MAE ## 33.91251 1150.05835 23.39439 b=round(as(a,"matrix")[1:10,1:10]) c < as(b,"realRatingMatrix") r=as(c,"data.frame") names(r) =c("batsman","bowler","BallsFaced")
This code generates the estimated dataframe with known and ‘predicted’ values
a1=merge(m,n,by=c("batsman","bowler")) a2=merge(a1,o,by=c("batsman","bowler")) a3=merge(a2,p,by=c("batsman","bowler")) a4=merge(a3,q,by=c("batsman","bowler")) a5=merge(a4,r,by=c("batsman","bowler")) a6= select(a5, batsman,bowler,BallsFaced,TotalRuns,Fours, Sixes, SR,TimesOut) head(a6) ## batsman bowler BallsFaced TotalRuns Fours Sixes SR TimesOut ## 1 AB de Villiers A Mishra 94 124 7 5 144 5 ## 2 AB de Villiers A Nortje 26 42 4 3 148 3 ## 3 AB de Villiers A Zampa 28 42 5 7 106 4 ## 4 AB de Villiers Abhishek Sharma 22 28 0 10 136 5 ## 5 AB de Villiers AD Russell 70 135 14 12 207 4 ## 6 AB de Villiers AF Milne 31 45 6 6 130 3
Just like the batsman performance estimation we can consider the bowler’s performances also for estimation. Consider the following table
As in the batsman analysis, for every batsman a set of features like (“strong backfoot player”, “360 degree player”,“Power hitter”) can be estimated with a set of initial values. Also every bowler will have an associated parameter vector θθ. Different bowlers will have performance data for different set of batsmen. Based on the initial estimate of the features and the parameters, gradient descent can be used to minimize actual values {for e.g. wicketsTaken(ratings)}.
load("recom_data/bowlerVsBatsman20_22.rdata")
Inspecting the bowler dataframe
head(df2) ## bowler1 batsman1 balls runsConceded ER wicketTaken ## 1 A Mishra A Badoni 0 0 0.000000 0 ## 2 A Mishra A Manohar 0 0 0.000000 0 ## 3 A Mishra A Nortje 0 0 0.000000 0 ## 4 A Mishra AB de Villiers 63 61 5.809524 0 ## 5 A Mishra Abdul Samad 0 0 0.000000 0 ## 6 A Mishra Abhishek Sharma 2 3 9.000000 0 names(df2) ## [1] "bowler1" "batsman1" "balls" "runsConceded" "ER" ## [6] "wicketTaken"
The below section estimates the balls bowled for each bowler. We can see that UBCF Pearson and UBCF Cosine both perform well
df3 < select(df2, bowler1,batsman1,balls) df6 < xtabs(balls ~ ., df3) df7 < as.data.frame.matrix(df6) df8 < data.matrix(df7) df8[df8 == 0] < NA r < as(df8,"realRatingMatrix") r0=r[(rowCounts(r) > 10),] getRatingMatrix(r0)[1:10,1:10] ## 10 x 10 sparse Matrix of class "dgCMatrix" ## [[ suppressing 10 column names 'A Badoni', 'A Manohar', 'A Nortje' ... ]] ## ## A Mishra . . . 63 . 2 35 . 6 . ## A Nortje . . . 21 25 . . . 6 . ## A Zampa . . . 9 . . . . . . ## Abhishek Sharma . . . 9 . . . . 6 . ## AD Russell . . . 117 12 9 . . 21 9 ## AF Milne . . . . . . . . 8 4 ## AJ Tye . . . 63 . . 66 . . . ## Akash Deep . . . . . . . . . . ## AR Patel . . . 188 5 1 84 . 29 5 ## Arshdeep Singh . . . 6 6 24 18 . 12 . summary(getRatings(r0)) ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 1.00 9.00 18.00 29.61 36.00 384.00 evalRecomMethods(r0[1:dim(r0)[1]],k1=5,given=7,goodRating1=median(getRatings(r0)))
a=eval(r0[1:dim(r0)[1]],0.8,k1=5,given1=7,goodRating1=median(getRatings(r0)),"UBCF") ## Recommender of type 'UBCF' for 'realRatingMatrix' ## learned using 96 users. ## 24 x 195 rating matrix of class 'realRatingMatrix' with 3954 ratings. ## RMSE MSE MAE ## 30.72284 943.89294 19.89204 b=round(as(a,"matrix")[1:10,1:10]) c < as(b,"realRatingMatrix") s=as(c,"data.frame") names(s) =c("bowler","batsman","BallsBowled")
This section estimates the runs conceded by the bowler. The UBCF Cosinus algorithm performs the best with TPR increasing fastewr than FPR
df3 < select(df2, bowler1,batsman1,runsConceded) df6 < xtabs(runsConceded ~ ., df3) df7 < as.data.frame.matrix(df6) df8 < data.matrix(df7) df8[df8 == 0] < NA r < as(df8,"realRatingMatrix") r0=r[(rowCounts(r) > 10),] getRatingMatrix(r0)[1:10,1:10] ## 10 x 10 sparse Matrix of class "dgCMatrix" ## [[ suppressing 10 column names 'A Badoni', 'A Manohar', 'A Nortje' ... ]] ## ## A Mishra . . . 61 . 3 41 . 15 . ## A Nortje . . . 36 57 . . . 8 . ## A Zampa . . . 3 . . . . . . ## Abhishek Sharma . . . 6 . . . . 3 . ## AD Russell . . . 276 12 6 . . 21 . ## AF Milne . . . . . . . . 10 4 ## AJ Tye . . . 69 . . 138 . . . ## Akash Deep . . . . . . . . . . ## AR Patel . . . 205 5 . 165 . 33 13 ## Arshdeep Singh . . . 18 3 51 51 . 6 . summary(getRatings(r0)) ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 1.00 9.00 24.00 41.34 54.00 458.00 evalRecomMethods(r0[1:dim(r0)[1]],k1=5,given=7,goodRating1=median(getRatings(r0))) ## Timing stopped at: 0.004 0 0.004 ## Warning in .local(x, method, ...): ## Recommender 'UBCF Pearson' has failed and has been removed from the results!
a=eval(r0[1:dim(r0)[1]],0.8,k1=5,given1=7,goodRating1=median(getRatings(r0)),"UBCF") ## Recommender of type 'UBCF' for 'realRatingMatrix' ## learned using 95 users. ## 24 x 195 rating matrix of class 'realRatingMatrix' with 3820 ratings. ## RMSE MSE MAE ## 43.16674 1863.36749 30.32709 b=round(as(a,"matrix")[1:10,1:10]) c < as(b,"realRatingMatrix") t=as(c,"data.frame") names(t) =c("bowler","batsman","RunsConceded")
This section computes the economy rate of the bowler. The performance is not all that good
df3 < select(df2, bowler1,batsman1,ER) df6 < xtabs(ER ~ ., df3) df7 < as.data.frame.matrix(df6) df8 < data.matrix(df7) df8[df8 == 0] < NA r < as(df8,"realRatingMatrix") r0=r[(rowCounts(r) > 10),] getRatingMatrix(r0)[1:10,1:10] ## 10 x 10 sparse Matrix of class "dgCMatrix" ## [[ suppressing 10 column names 'A Badoni', 'A Manohar', 'A Nortje' ... ]] ## ## A Mishra . . . 5.809524 . 9.00 7.028571 . 15.000000 . ## A Nortje . . . 10.285714 13.68 . . . 8.000000 . ## A Zampa . . . 2.000000 . . . . . . ## Abhishek Sharma . . . 4.000000 . . . . 3.000000 . ## AD Russell . . . 14.153846 6.00 4.00 . . 6.000000 . ## AF Milne . . . . . . . . 7.500000 6.0 ## AJ Tye . . . 6.571429 . . 12.545455 . . . ## Akash Deep . . . . . . . . . . ## AR Patel . . . 6.542553 6.00 . 11.785714 . 6.827586 15.6 ## Arshdeep Singh . . . 18.000000 3.00 12.75 17.000000 . 3.000000 . summary(getRatings(r0)) ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 0.3529 5.2500 7.1126 7.8139 9.8000 36.0000 evalRecomMethods(r0[1:dim(r0)[1]],k1=5,given=7,goodRating1=median(getRatings(r0))) ## Timing stopped at: 0.003 0 0.004 ## Warning in .local(x, method, ...): ## Recommender 'UBCF Pearson' has failed and has been removed from the results!
a=eval(r0[1:dim(r0)[1]],0.8,k1=5,given1=7,goodRating1=median(getRatings(r0)),"UBCF") ## Recommender of type 'UBCF' for 'realRatingMatrix' ## learned using 95 users. ## 24 x 195 rating matrix of class 'realRatingMatrix' with 3839 ratings. ## RMSE MSE MAE ## 4.380680 19.190356 3.316556 b=round(as(a,"matrix")[1:10,1:10]) c < as(b,"realRatingMatrix") u=as(c,"data.frame") names(u) =c("bowler","batsman","EconomyRate")
The code below computes the wickets taken by the bowler versus different batsmen
df3 < select(df2, bowler1,batsman1,wicketTaken) df6 < xtabs(wicketTaken ~ ., df3) df7 < as.data.frame.matrix(df6) df8 < data.matrix(df7) df8[df8 == 0] < NA r < as(df8,"realRatingMatrix") r0=r[(rowCounts(r) > 10),] getRatingMatrix(r0)[1:10,1:10] ## 10 x 10 sparse Matrix of class "dgCMatrix" ## [[ suppressing 10 column names 'A Badoni', 'A Manohar', 'A Nortje' ... ]] ## ## A Mishra . . . . . . 1 . . . ## A Nortje . . . 4 . . . . . . ## A Zampa . . . 3 . . . . . . ## AD Russell . . . 3 . . . . . . ## AJ Tye . . . 3 . . 6 . . . ## AR Patel . . . 4 . 1 3 . 1 1 ## Arshdeep Singh . . . 3 . . 3 . . . ## AS Rajpoot . . . . . . 3 . . . ## Avesh Khan . . . . . . 1 . 3 . ## B Kumar . . . 9 . . 3 . 1 . summary(getRatings(r0)) ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 1.000 3.000 3.000 3.423 3.000 21.000 evalRecomMethods(r0[1:dim(r0)[1]],k1=5,given=7,goodRating1=median(getRatings(r0))) ## Timing stopped at: 0.003 0 0.003 ## Warning in .local(x, method, ...): ## Recommender 'UBCF Pearson' has failed and has been removed from the results!
a=eval(r0[1:dim(r0)[1]],0.8,k1=5,given1=7,goodRating1=median(getRatings(r0)),"UBCF") ## Recommender of type 'UBCF' for 'realRatingMatrix' ## learned using 64 users. ## 16 x 195 rating matrix of class 'realRatingMatrix' with 1908 ratings. ## RMSE MSE MAE ## 2.672677 7.143203 1.956934 b=round(as(a,"matrix")[1:10,1:10]) c < as(b,"realRatingMatrix") v=as(c,"data.frame") names(v) =c("bowler","batsman","WicketTaken")
The entire dataframe is regenerated with known and ‘predicted’ values
r1=merge(s,t,by=c("bowler","batsman")) r2=merge(r1,u,by=c("bowler","batsman")) r3=merge(r2,v,by=c("bowler","batsman")) r4= select(r3,bowler, batsman, BallsBowled,RunsConceded,EconomyRate, WicketTaken) head(r4) ## bowler batsman BallsBowled RunsConceded EconomyRate WicketTaken ## 1 A Mishra AB de Villiers 102 144 8 4 ## 2 A Mishra Abdul Samad 13 20 7 4 ## 3 A Mishra Abhishek Sharma 14 26 8 2 ## 4 A Mishra AD Russell 47 85 9 3 ## 5 A Mishra AJ Finch 45 61 11 4 ## 6 A Mishra AJ Tye 14 20 5 4
This post showed an approach for performing the Batsmen Performance Estimate & Bowler Performance Estimate. The performance of the recommender engine could have been better. In any case, I think this approach will work for player estimation provided the recommender algorithm is able to achieve a high degree of accuracy. This will be a good way to estimate as the algorithm will be able to determine features and nuances of batsmen and bowlers which cannot be captured by data.
To see all posts click Index of posts
In this article, you’ll learn how to change the colors of ggplot2 visuals in the R programming language by utilizing different themes.
Let&#...
Continue reading: Change ggplot2 Theme Color in R ggthemr Package]]>The post Change ggplot2 Theme Color in R ggthemr Package appeared first on
Data Science Tutorials –
In this article, you’ll learn how to change the colors of ggplot2 visuals in the R programming language by utilizing different themes.
Let’s start by making some sample data in R.
set.seed(123) df < data.frame(x = rnorm(100), y = rnorm(100), group = paste0("group_", LETTERS[1:5])) head(df) x y group 1 0.7842588 0.3663221 group_A 2 1.2604525 0.4016048 group_B 3 1.0959772 0.5219341 group_C 4 2.4218639 1.1077222 group_D 5 0.7651984 0.7216053 group_E 6 0.5019940 1.3688690 group_A
The head of our sample data is visible above, as well as the fact that our data is made up of three variables. x and y are numerical variables, while the group is a character variable.
How to make a rounded corner bar plot in R? – Data Science Tutorial
To use the ggplot2 package to visualize our data, we must first install and import it into R.
install.packages("ggplot2") library("ggplot2")
Then, using the R syntax below, we can make a ggplot2 scatterplot of our data.
plot< ggplot(df, aes(x, y, col = group)) + geom_point() plot
By using the preceding R syntax, we were able to produce a ggplot2 plot with default color parameters, as seen in Figure 1.
The following R code shows how to change the colors of our graph using the ggplot2 package’s predefined themes.
We can utilize the several predefined theme functions offered by the ggplot2 package for this task, as illustrated below.
plot_bw < plot + theme_bw() + ggtitle("theme_bw()")
plot_classic < plot + theme_classic() + ggtitle("theme_classic()")
plot_dark < plot + theme_dark() + ggtitle("theme_dark()")
How to perform the MANOVA test in R? – Data Science Tutorials
plot_gray < plot + theme_gray() + ggtitle("theme_gray()")
plot_linedraw < plot + theme_linedraw() + ggtitle("theme_linedraw()")
plot_light < plot + theme_light() + ggtitle("theme_light()")
plot_minimal < plot + theme_minimal() + ggtitle("theme_minimal()")
plot_test < plot + theme_test() + ggtitle("theme_test()")
Classification Problem in Machine Learning »
plot_void < plot + theme_void() + ggtitle("theme_void()")
The previous R code generated nine ggplot2 charts, each with a unique theme design.
The patchwork package can then be used to create a comparison grid of these plots. We’ll need to install and load the patchwork package first.
install.packages("patchwork") library("patchwork")
Now we can arrange our nine plots in a grid as seen below:
(plot_bw + plot_classic + plot_dark) / (plot_gray + plot_linedraw + plot_light) / (plot_minimal + plot_test + plot_void)
The previous diagram depicts the many themes and their various color schemes.
As you can see, the majority of these themes are rather straightforward. Let’s make our plots more colorful!
Using the ggthemr package, change the color scheme of a ggplot2 plot.
The ggthemr package by Ciarán Tobin, which is now maintained by Mikata Project, is a powerful tool for designing, styling, and coloring ggplot2 plots.
The following R code can be used to install and load the package.
library("devtools") devtools::install_github("MikataProject/ggthemr") library("ggthemr")
The ggthemr function can now be used to modify the ggplot2 theme layout. The R algorithms and graphical outputs below show how to change the theme choices of our example plot to suit other styles.
Intro to TensorflowMachine Learning with TensorFlow »
ggthemr("flat") plot
ggthemr("flat dark") plot
ggthemr("camouflage") plot
ggthemr("chalk") plot
ggthemr("copper") plot
ggthemr(“dust”)
plot
ggthemr("earth") plot
ggthemr("fresh") plot
ggthemr("grape") plot
ggthemr("grass") plot
ggthemr("greyscale") plot
ggthemr("light") plot
ggthemr("lilac") plot
ggthemr("pale") plot
ggthemr("sea") plot
ggthemr("sky") plot
ggthemr("solarized") plot
As you can see, the ggthemr package comes with a wide range of predefined themes.
You can also manually develop your own ggplot2 theme if you want complete control over your ggplot2 themes. More information about custom themes may be found here.
Data Mining and Knowledge Discovery »
The post Change ggplot2 Theme Color in R ggthemr Package appeared first on Data Science Tutorials
The post Best Books to Learn R Programming appeared first on
Data Science Tutorials –
Best Books to Learn R Programming, R and Python are now the most popular programming languages for performing data science, and each has its own set of advantages and downsides.
Python is preferred for Machine Learning because of its productionready architecture and the ease with which data analysis can be integrated with websites.
By virtue of its inherent statistical nature, data miners and statisticians favor R for constructing statistical computing tools. R is both a programming language and a platform for statistical programming.
In this post, we’ll look at the finest R books for realizing your ambition of working in the lucrative profession of data science, or for improving your skills if you already have one.
5 Free Books to Learn Statistics For Data Science
Before we begin, keep in mind that this is a list of the ten finest R books in general; we are not comparing these books to one another.
The book listed first does not have to be better than the books listed second and third. They are all deserving of inclusion on the list, in our opinion.
1. R in Action
Amazon – 4.5/5(199 ratings)
Dr. Robert L. Kabacoff has written several wonderful publications on R, including R in Action.
Also, The second version of the book describing R provides users with thorough realworld examples from business, science, and technology.
The R in Action book includes a crash education in statistics in addition to realworld data science challenges and practical Rbased solutions.
This basically entails describing effective strategies for making sense of incomplete, ambiguous, and massive amounts of data.
R in Action also shows how to use R’s graphical features to explore, manage, and solve data visualization problems.
More chapters on data mining, dynamic report creation, and forecasting have been added to the newest edition of the R in Action book.
Topics covered
Basics of ggplot2, Data mining, Data visualization, EDA (Exploratory Data Analysis), Graphics in R, Machine learning models.
Link:https://amzn.to/3wb2H8k
2. R for Data Science
Amazon – 4.7 (1179 ratings)
Hadley Wickham is a wellknown author who focuses on the inner (and exterior) workings of the R programming language, as well as data science.
The R for Data Science is a full package for readers who want to examine and assimilate both data science and R in one sitting.
The R for Data Science book starts with gaining a comprehensive understanding of data science, its implementation, and the science behind it.
The book picks up the pace of exploiting the R platform for diverse data science activities and operations as early as the first few chapters.
Garrett Grolemund, the book’s (credited) principal author and an RStudio Master Instructor, takes on the task of explaining the practical, realworld implementation of the synergy between R and data science in a way that is both entertaining and compelling to do and learn more.
Topics covered
Data wrangling, Data visualization, Exploratory data analysis, Fundamentals of R, Fundamentals of data science, Implementation of R, and data science.
Link:https://amzn.to/37NinFq
3. The Art of R Programming – A Tour of Statistical Software Design
Amazon – 4.4/5 (246 ratings)
The Art of R Programming by Norman Matloff is another book that deserves to be included among the top R books.
The book’s author is a professor of computer science at the University of California and the designer of several popular software programs. Learning from him is certainly beneficial.
The Art of R Programming does not require any statistical knowledge and may be used even if you have only a basic understanding of programming.
As a result, it is ideal for novices. The R Programming book provides a thorough overview of R programming.
The Art of R Programming also covers objectoriented and functional programming paradigms, sophisticated data rearrangement, and executing mathematical simulations, in addition to R and software development.
Topics covered
Complex functions, Data visualization, Fundamentals of R, Statistical Programming, and Statistical software development.
Link:https://amzn.to/3FRKvnl
4. HandsOn Programming with R: Write Your Own Functions and Simulations
Amazon – 4.4/5 (237 ratings)
The book HandsOn Programming with R explains how to build and disassemble data objects, load data, navigate the R environment, use R’s tools, and create userdefined functions. The book accomplishes this with simple language.
Free Best Online Course For Statistics
To make learning enjoyable, Three practical data analysis tasks inspired by casino games are included in the HandsOn Programming with R book.
Each one provides detailed examples that demonstrate a variety of R programming skills, including data visualization and modeling.
Garrett Grolemund, the RStudio Master Instructor and coauthor of another excellent book based on the R platform, R for Data Science, wrote HandsOn Programming with R.
Aside from R, the instructor utilizes the book to teach readers about data science and programming in general.
Topics covered
Basics of R, Data modeling, Data visualization, Fundamentals of data science, Complementary R tools and software.
Link:https://amzn.to/3yHqtup
5. R Graphics Cookbook: Practical Recipes for Visualizing Data
Amazon – 4.6/5 (72 ratings)
The R Graphics Cookbook is a wonderful read for anyone looking for a book that graphically explains R while remaining focused on its graphics features.
It includes 150+ images, known as recipes, for easily creating highquality graphics with the R platform.
Each recipe addresses a specific issue and provides a detailed solution. The why and how of the socalled recipe are also thoroughly discussed in order for readers to gain a thorough comprehension of the relevant topics.
The ggplot2 package is used in the majority of examples.
Best online course for R programming
Winston Chang, a software engineer at RStudio, wrote the R Graphics Cookbook.
The R Graphics Cookbook is, interestingly, an updated version of the author’s previous endeavor, the Cookbook for R.
It was a website that detailed scripts for effectively performing common R tasks.
Topics covered
Data visualization, Graphics in R, Solutions to common/redundant tasks, and The visual design of graphics.
Link:https://amzn.to/3LhBCo6
6. R Packages: Organize, Test, Document, and Share Your Code
Amazon – 4.7/5 (70 ratings)
Organize, Test, Document, and Share with R Packages Your Code is designed for students who want to have a thorough understanding of the R packages.
The book not only explains the fundamental concepts of R packages but also walks you through the process of building and sharing your own.
The book on R programming teaches users how to use two of the most popular R tools, dev tools, and oxygen.
The readers will test, investigate, and build a knowledge of how R packages automate common development chores throughout the R Packages book.
Hadley Wickham, an Adjunct Professor of Statistics at Rice University, a prominent author (who wrote R for Data Science), and the Chief Scientist at RStudio, wrote R Packages.
Wickham has made numerous contributions to the development and improvement of the R platform, as well as teaching the complexity of R with equal ease as the fundamentals.
Topics covered
Basics of R programming, R packages: working, developing, implementation, and optimization, Adding documentation, Reusability of R functions, Uploading R packages, Data sampling, Devtools and oxygen, The importance of documentation in R.
Link:https://amzn.to/3NeiRU5
7. Practical Data Science with R
Amazon – 4.3/5 (24 ratings)
Manning Publications is wellknown for publishing books about programming and associated technology.
Practical Data Science with R, published by the publishing behemoth, covers not only the popular data science platform, R but also the discipline of data science.
Nina Zumel and John Mount’s book, Practical Data Science with R, helps readers get a solid understanding of the practical applications of data science and how R may assist them.
It also illustrates the statistical tools needed to solve complicated business problems in an elegant manner.
TwoWay ANOVA Example in RQuick Guide
The book Practical Data Science with R is jampacked with examples from business intelligence (BI), decisionmaking, and marketing.
These are used to demonstrate the process of constructing predictive models, designing appropriate tests, and customizing outcomes to a wide range of audiences, including both professionals and beginners.
Topics covered
Basics of data science, Fundamentals of R programming, Graphics in R, Implementation of the R platform, Predictive modeling.
Link:https://amzn.to/3LesgJN
8. R for Everyone: Advanced Analytics and Graphics
Amazon – 4.4/5 (198 ratings)
R for Everyone is an R reference book for everyone, as the title says. It begins with the fundamentals of the R programming language and progresses to more sophisticated R activities such as adding rich documentation, performing advanced analytics, and creating your own packages.
It accomplishes so in 30 selfcontained chapters chockfull of detailed, handson code examples.
R for Everyone makes it simple for pros and even beginners to get started with R, even if they have never worked with statistical programming before.
This is something that the book’s author, Jared P. Lander, a renowned data scientist, has been doing for a long time in his career. As a result, this book is simply an extension of his easytounderstand teaching technique/narrative.
The R book includes everything you need to know about R, from installing and configuring the environment to writing your own R packages. All you need is the desire to succeed.
Topics covered
Basics of R, Basics of statistics, Data modeling, Data visualization, Interactive dashboard using Shiny, R packages development, Rich documents using RMarkdown, Statistical programming.
Link:https://amzn.to/3wqsSqo
9. The Book of R: A First Course in Programming and Statistics
Amazon – 4.6/5 (252 ratings)
One of the most noobfriendly R books available is The Book of R: A First Course in Programming and Statistics.
The Book of R requires little more from readers than a basic understanding of mathematics and a commitment to learning R.
The book is lengthy not because of the material, but because of the numerous, comprehensive examples.
The R book is chockfull of detailed examples that will assist the reader in better understanding R’s ideas.
The Book of R is a remarkable example of how combining extensive topic explanations with rich, realworld examples improves the degree and ease of learning dramatically.
Dealing With Missing values in R
The Book of R covers constructing and conducting optimum statistical tests and models, producing statistical summaries, and creating publicationready graphics in the advanced R portion.
Tilman M. Davies, a Ph.D. holder and statistics lecturer at the University of Otago, wrote the R book (New Zealand). The book was inspired by the author’s annual threeday program, Introduction to R.
Topics covered
Basics of Statistics, Data visualization using ggplot2, ggvis, and rgl packages, Fundamentals of R, Graphics in R, Implementation of the R platform.
Link:https://amzn.to/3Lecezy
10. The R Book
Amazon – 4.4/5 (125 ratings)
The R Book uses fullcolor text and comprehensive images to teach learners everything they need to know about the R platform, from the basics to advanced topics like designing Rbased solutions to solve complicated data science problems.
Aside from the wide range of topics covered by The R Book, the book on R also includes a review of R’s evolution over the previous five years (from the date of publication of the book). A new chapter on Bayesian Analysis and MetaAnalysis is included in the new edition.
Michael J. Crawley, an FRS in the Department of Biological Sciences at Imperial College of Science, Technology, and Medicine, wrote The R Book.
The author has extensive experience in generating interest in data science, the R platform, and the solution to complicated, realworld problems.
Topics covered
Basics of R, Bayesian Analysis and MetaAnalysis, Data science fundamentals, Statistical programming, The evolution of the R platform.
Link:https://amzn.to/3sBS2kJ
That concludes our list of the top ten R programming books. Regardless of where you stand on the R programming language proficiency scale, you’ll find one – or more – of these 10 top R books useful for your future R and data science activities.
Which R books are your favorites?
Share your thoughts in the comments area below
The post Best Books to Learn R Programming appeared first on Data Science Tutorials
My package rAmCharts4 has a moderate success on Github (twenty stars). So I decided to present it here.
Hey kid, fancy some selfdocumenting {ggplots} like this one:
Just read on!
I’ve been working hard on a package that I’ve called {chronicler} (read my post on it
here) which allows you to
attach a log to the objects you create, thus making it easy to know how ...
Hey kid, fancy some selfdocumenting {ggplots}
like this one:
Just read on!
I’ve been working hard on a package that I’ve called {chronicler}
(read my post on it
here) which allows you to
attach a log to the objects you create, thus making it easy to know how some data (for example)
has been created. Here’s a quick example and intro to the main features:
suppressPackageStartupMessages( library(dplyr) ) library(chronicler) # record() decorates functions so they provide enriched output r_group_by < record(group_by) r_select < record(select) r_summarise < record(summarise) r_filter < record(filter) output_pipe < starwars %>% r_select(height, mass, species, sex) %>=% # < this is a special pipe operator to handle `chronicle` objects r_group_by(species, sex) %>=% r_filter(sex != "male") %>=% r_summarise(mass = mean(mass, na.rm = TRUE))
output_pipe
not only has the result of all the {dplyr}
operations, but also carries a log
with it. Let’s print the object:
output_pipe ## OK! Value computed successfully: ##  ## Just ## # A tibble: 9 × 3 ## # Groups: species [9] ## species sex mass ## <chr> <chr> <dbl> ## 1 Clawdite female 55 ## 2 Droid none 69.8 ## 3 Human female 56.3 ## 4 Hutt hermaphroditic 1358 ## 5 Kaminoan female NaN ## 6 Mirialan female 53.1 ## 7 Tholothian female 50 ## 8 Togruta female 57 ## 9 Twi'lek female 55 ## ##  ## This is an object of type `chronicle`. ## Retrieve the value of this object with pick(.c, "value"). ## To read the log of this object, call read_log(.c).
Accessing the value is possible with pick("value")
:
pick(output_pipe, "value") ## # A tibble: 9 × 3 ## # Groups: species [9] ## species sex mass ## <chr> <chr> <dbl> ## 1 Clawdite female 55 ## 2 Droid none 69.8 ## 3 Human female 56.3 ## 4 Hutt hermaphroditic 1358 ## 5 Kaminoan female NaN ## 6 Mirialan female 53.1 ## 7 Tholothian female 50 ## 8 Togruta female 57 ## 9 Twi'lek female 55
and you can read the log with read_log()
:
read_log(output_pipe) ## [1] "Complete log:" ## [2] "OK! select(height,mass,species,sex) ran successfully at 20220515 17:10:43" ## [3] "OK! group_by(species,sex) ran successfully at 20220515 17:10:43" ## [4] "OK! filter(sex != \"male\") ran successfully at 20220515 17:10:43" ## [5] "OK! summarise(mean(mass, na.rm = TRUE)) ran successfully at 20220515 17:10:43" ## [6] "Total running time: 0.0434844493865967 secs"
If you want to understand how this works, I suggest you read the blog post I linked above but also
this one, where I explain the nitty gritty,
theoretical details behind what {chronicler}
does. To make a long story short, {chronicler}
uses an advanced functional programming concept called a monad. And using the power of monads,
I can now make selfdocumenting {ggplot2}
graphs.
The idea was to be able to build a plot in a way similar to how I built that dataset just above,
and have a log of how it was created attached to it. The issue is that the function that
transforms functions to chronicler
functions, record()
, does not work on {ggplot2}
functions.
This is because the way you create {ggplot2}
graphs is by adding layers on top of each other:
library(ggplot2) ggplot(mtcars) + geom_point(aes(mpg, hp))
The +
here acts as a way to “add” the geom_point(mpg, hp)
layer on top of the ggplot(mtcars)
layer.
I remember reading some tweets, quite some time ago, from people asking why %>%
couldn’t work with
{ggplot2}
and if Hadley Wickham, the developer of {ggplot2}
, was considering making something like this
work:
ggplot(mtcars) %>% geom_point(aes(mpg, hp))
because people kept forgetting using +
and kept using %>%
. The thing is, %>%
and +
do very
different things. %>%
takes its first argument and passes it as the first argument of its second
argument, in other words this:
a %>% f(b)
is exactly the same as:
f(a, b)
This is not what {ggplot2}
functions do. When you call +
on {ggplot2}
objects, this is NOT what happens:
geom_point(ggplot(mtcars), aes(mpg, hp))
So that’s why %>%
cannot be used with {ggplot2}
functions, and that’s also why the functions I developed
in {chronicle}
could not handle {ggplot2}
functions either. So I had to provide new functions. The first
function I developed is called ggrecord()
and it decorates {ggplot2}
functions:
r_ggplot < ggrecord(ggplot) r_geom_point < ggrecord(geom_point) r_labs < ggrecord(labs)
Now the output of these functions are not ggplot
objects anymore, but chronicle objects. So to make
layering them possible, I also needed to rewrite +
. I called my rewritten +
like this: %>+%
:
a < r_ggplot(mtcars) %>+% r_geom_point(aes(y = mpg, x = hp)) %>+% r_labs(title = "Selfdocumenting ggplot!\nLook at the bottom right", caption = "This is an example caption")
Let’s first take a look at a
:
a ## OK! Ggplot computed successfully: ##  ## Just
## ##  ## This is an object of type `chronicle`. ## Retrieve the value of this object with pick(.c, "value"). ## To read the log of this object, call read_log(.c).
As before expected, a
is now an object of type {chronicle}
, where its “value” is a ggplot
object.
But where is the selfdocumenting part?
For this, you use the last piece of the puzzle, document_gg()
:
document_gg(a) ## OK! Ggplot computed successfully: ##  ## Just
## ##  ## This is an object of type `chronicle`. ## Retrieve the value of this object with pick(.c, "value"). ## To read the log of this object, call read_log(.c).
The caption now contains the log of the plot, making it easily reproducible!
This is still in very early development, but if you want to try it out, you’ll need to try the dev
branch of the package.
Any feedback, comments, ideas, pull requests, more than welcome.
Hope you enjoyed! If you found this blog post useful, you might want to follow me on twitter for blog post updates and buy me an espresso or paypal.me, or buy my ebook on Leanpub. You can also watch my videos on youtube. So much content for you to consoom!
Below is the 4th exercise in book Mastering Shiny, Chapter 4: Case study: ER injuries.
“Provide a way to step through every narrative systematically with forward and backward buttons.Advanced: Make the list of narratives “circular” so that advancing forward from the last narrative takes you to the first.”
Collin Berke provided a solution, where the %%
was used for the circular purpose. However, employing %%
is a little tricky in this exercise. Thus, I use if else
in both Next and Previous arrows. Here is the code.
library(shiny) library(ggplot2) library(vroom) library(tidyverse) # set your own directory setwd('../../../data') injuries < vroom::vroom("neiss/injuries.tsv.gz") population < vroom::vroom("neiss/population.tsv") products < vroom::vroom("neiss/products.tsv") prod_codes < setNames(products$prod_code, products$title) count_top < function(df, var, n = 5) { df %>% mutate({{ var }} := fct_lump(fct_infreq({{ var }}), n = n)) %>% group_by({{ var }}) %>% summarise(n = as.integer(sum(weight))) } ui < fluidPage( fluidRow( column(8, selectInput("code", "Product", choices = setNames(products$prod_code, products$title), width = "100%" ) ), column(2, selectInput("y", "Y axis", c("rate", "count"))), column(2, numericInput("num_row", "Number of rows", 5, min = 1, max = 15)) ), fluidRow( column(4, tableOutput("diag")), column(4, tableOutput("body_part")), column(4, tableOutput("location")) ), fluidRow( column(12, plotOutput("age_sex")), ), fluidRow( column(2, span("Narrative Display:", style = "fontweight: bold")), column(1, actionButton(inputId ="Previous", label = icon("arrowleft"))), column(1, actionButton(inputId ="Next", label = icon("arrowright"))), column(8, textOutput("narrative")) ) ) server < function(input, output, session) { # use the last 6 records to test narrative selected < reactive(injuries %>% filter(prod_code == input$code) %>% slice_tail(n=6)) n_row < reactive(input$num_row) # number of rows displayed in tables output$diag < renderTable(count_top(selected(), diag, n = n_row()), width = "100%") output$body_part < renderTable(count_top(selected(), body_part, n = n_row()), width = "100%") output$location < renderTable(count_top(selected(), location, n = n_row()), width = "100%") summary < reactive({ selected() %>% count(age, sex, wt = weight) %>% left_join(population, by = c("age", "sex")) %>% mutate(rate = n / population * 1e4) }) output$age_sex < renderPlot({ if (input$y == "count") { summary() %>% ggplot(aes(age, n, colour = sex)) + geom_line() + labs(y = "Estimated number of injuries") } else { summary() %>% ggplot(aes(age, rate, colour = sex)) + geom_line(na.rm = TRUE) + labs(y = "Injuries per 10,000 people") } }, res = 96) ### narrative num_narr < reactive( length(selected()$narrative) ) # a reactive value that can be easily changed later (in events) # ref: https://stackoverflow.com/questions/42183161/rshinyhowtochangevaluesinareactivevaluesobject i < reactiveValues(tmp=1) # reset i to 1 if code is changed by user # ref: https://www.collinberke.com/post/shinyseriesimplementinganextandbackbutton/ observeEvent(input$code, { i$tmp < 1 }) output$narrative < renderText({ selected()$narrative[1] }) observeEvent(input$Next, { i$tmp < i$tmp + 1 if(i$tmp <= num_narr()){ output$narrative < renderText({ selected()$narrative[i$tmp] }) } else{ i$tmp < 1 output$narrative < renderText({ selected()$narrative[1] }) } }) observeEvent(input$Previous, { i$tmp < i$tmp  1 if(i$tmp > 0){ output$narrative < renderText({ selected()$narrative[i$tmp] }) } else{ i$tmp < num_narr() output$narrative < renderText({ selected()$narrative[num_narr()] }) } }) } shinyApp(ui, server)

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81  #========================================================# # Quantitative ALM, Financial Econometrics & Derivatives # ML/DL using R, Python, Tensorflow by SangHeon Lee # # https://kiandlee.blogspot.com #——————————————————–# # IRRBB Interest Rate Shock Scenarios #========================================================# graphics.off(); rm(list = ls()) #======================================================= # read data #======================================================= str.zero <– “mat rate 0.083333333 0.01529 0.25 0.01648 0.5 0.01815 0.75 0.01947 1 0.02045 2 0.02252 3 0.02410 4 0.02549 5 0.02637 6 0.02730 7 0.02787 8 0.02825 9 0.02857 10 0.02881 12 0.02913 15 0.02953 20 0.03042″ df <– read.table(text = str.zero, header = TRUE) #======================================================= # make shocks #======================================================= # size of shocks by each currency # USD parameter : BCBS (2016) r_bar_p_c <– 0.02 r_bar_s_c <– 0.03 r_bar_l_c <– 0.015 # placeholder for rate shocks dr <– data.frame(base = rep(0,length(df$mat))) # Parallel shock up/down dr$pu <– r_bar_p_c dr$pd <– –r_bar_p_c # Short rate shock up/down dr$su <– r_bar_s_c*exp(–df$mat/4) dr$sd <– –r_bar_s_c*exp(–df$mat/4) # Long rate shock drlong <– r_bar_l_c*(1–exp(–df$mat/4)) # Rotational shock dr$steep <– –0.65*dr$su + 0.9*drlong dr$flat <– 0.8*dr$su – 0.6*drlong dr #======================================================= # generate shocked curves #======================================================= (df.irrbb <– 0.01 + dr) x11(width = 5, height = 4) matplot(df$mat, df.irrbb, type=“l”, lty = 1, lwd = 3, xlab = “Maturity(year)”, ylab = “Yield(decimal)”, main = “IRRBB IR shock scenarios (Base = 1%)”) (df.irrbb <– df$rate + dr) x11(width = 5, height = 4) matplot(df$mat, df.irrbb, type=“l”, lty = 1, lwd = 3, xlab = “Maturity(year)”, ylab = “Yield(decimal)”, main = “IRRBB IR shock scenarios”)  cs 
I’m happy to announce that survivoR v1.0 is now on CRAN. The package now contains all the features intended for the first major release. A big thank you to Carly Levitz for helping collate and test the data.
This post details the major updates since v0.9.12. For a complete list of tables and features of the package please visit the Github page.
To jump right into it you can install the package with
install.packages("survivoR")
Or from Git with
devtools::install_github("doehm/survivoR")
If you find an issues please raise them on Github and I’ll correct them asap. For updates feel free to follow myself and Carly on Twitter.
This release features new datasets, additional fields on existing tables, and the removal of unused or redundant features.
New tables:
advantage_details
– Details of each advantage found and used across all seasonsadvantage_movement
– Details the movement of each advantage and when each advantage is played including hidden immunity idolsboot_mapping
– A mapping table for the stage of the game referenced by the number of boots there have beenNew features:
vote_history
tribe
– The name of the tribe that attended Tribal Council.vote_event
– To identify other events that can occur at Tribal Council e.g. castaway played the ShotintheDark.split_vote
– If there was a split vote orchestrated to flush an idol this identifies who the votes were split across and who was involved with the strategy.challenge_results
order
– The boot order references how many boots there have been in the game so far. This is to map to the boot_mapping
tableviewers
imdb_rating
– The IMDb rating for the episode. Given these are user ratings they may change over time. With each new release, the ratings will be updated however only minor changes are expected for the most recent season.Removed features:
castaways
swapped_tribe
swapped_tribe_2
merged_tribe
total_votes_received
immunity_idols_won
Updates:
confessionals
– Double episodes have been collapsed to ensure alignment with episodes on all other tables. This will impact mean confessionals per episode calculations but has a more consistent and convenient structure. Recap episodes are also accounted for.All advantages and hidden immunity idols found across all seasons are captured in these two tables. The tables map to each other by advantage_id
and detail the life of each advantage in tidy format.
This dataset lists the hidden idols and advantages in the game for all seasons. It details where it was found, if there was a clue to the advantage, location, and other advantage conditions. This maps to the advantage_movement
table.
> advantage_details > + filter(season == 41) # A tibble: 9 x 9 version version_season season_name season advantage_id advantage_type clue_details location_found conditions <chr> <chr> <chr> <dbl> <chr> <chr> <chr> <chr> <chr> 1 US US41 Survivor: 41 41 USEV4101 Extra vote No clue exis~ Shipwheel Isl~ NA 2 US US41 Survivor: 41 41 USEV4102 Extra vote No clue exis~ Shipwheel Isl~ NA 3 US US41 Survivor: 41 41 USEV4103 Extra vote No clue exis~ Shipwheel Isl~ NA 4 US US41 Survivor: 41 41 USHI4101 Hidden immunit~ Found withou~ Found around ~ Beware advantage: all~ 5 US US41 Survivor: 41 41 USHI4102 Hidden immunit~ Found withou~ Found around ~ Beware advantage: all~ 6 US US41 Survivor: 41 41 USHI4103 Hidden immunit~ Found withou~ Found around ~ Beware advantage: all~ 7 US US41 Survivor: 41 41 USHI4104 Hidden immunit~ Found withou~ Found around ~ Beware advantage: all~ 8 US US41 Survivor: 41 41 USKP4101 Knowledge is p~ No clue exis~ Found around ~ NA 9 US US41 Survivor: 41 41 USVS4101 Steal a vote No clue exis~ Shipwheel Isl~ NA
The advantage_movement
table tracks who found the advantage, who they may have handed it to, and who they played it for. Each step is considered an event. The sequence_id
tracks the logical step of the advantage. For example, in season 41, JD found an extra vote advantage. JD gave it to Shan in good faith who then voted him out keeping the extra vote. Shan gave it to Ricard in good faith who eventually gave it back before Shan played it for Naseer. That movement is recorded in this table.
Who the advatnge was eventually played for, if it was successful or not needed is included in this table. Or in the unfortunate situations when someone is blindsided and voted out with the advantage, that is recorded here.
> advantage_movement > + filter(advantage_id == "USEV4102") # A tibble: 5 x 15 version version_season season_name season castaway castaway_id advantage_id sequence_id day episode event played_for played_for_id success votes_nullified <chr> <chr> <chr> <dbl> <chr> <chr> <chr> <dbl> <dbl> <dbl> <chr> <chr> <chr> <chr> <dbl> 1 US US41 Survivor: 41 41 JD US0603 USEV4102 1 2 1 Found NA NA NA NA 2 US US41 Survivor: 41 41 Shan US0606 USEV4102 2 9 4 Received NA NA NA NA 3 US US41 Survivor: 41 41 Ricard US0596 USEV4102 3 9 4 Received NA NA NA NA 4 US US41 Survivor: 41 41 Shan US0606 USEV4102 4 11 5 Received NA NA NA NA 5 US US41 Survivor: 41 41 Shan US0606 USEV4102 5 17 9 Played Naseer US0600 Yes NA
The boot_mapping
table is to easily filter to the set of castaways that are still in the game after a specified number of boots. How this differs from the tribe mapping is that rather than being focused on an episode, it is focused on the boot which is often more useful. The number of boots and who is left in the game is often the better indicator of the stage of the game than the episode or day. When someone quits the game or is medically evacuated it is considered a boot. This table tracks multiple boots per episode.
In the case of double tribal councils there is an order in which castaways have their torch snuffed. This is also capture even though it means there is a set of players still remaining for literally minutes before the next leaves the game.
If you needed to determine who is left in the game of season 41 after 12 boots (12 people have either been voted off or left the game) you can use the following code.
> boot_mapping > + filter( + season == 41, + order == 12 + ) # A tibble: 6 x 11 version version_season season_name season episode order castaway castaway_id tribe tribe_status in_the_game <chr> <chr> <chr> <dbl> <dbl> <dbl> <chr> <chr> <chr> <chr> <lgl> 1 US US41 Survivor: 41 41 12 12 Heather US0593 Via Kana Merged TRUE 2 US US41 Survivor: 41 41 12 12 Erika US0594 Via Kana Merged TRUE 3 US US41 Survivor: 41 41 12 12 Ricard US0596 Via Kana Merged TRUE 4 US US41 Survivor: 41 41 12 12 Xander US0597 Via Kana Merged TRUE 5 US US41 Survivor: 41 41 12 12 Danny US0599 Via Kana Merged TRUE 6 US US41 Survivor: 41 41 12 12 Deshawn US0601 Via Kana Merged TRUE
A an example, the boot_mapping
table can be used to calculate how many people and who participated in certain challenges once mapped to challenge_results
.
df_challenges < challenge_results > unnest(winners) > filter( season == 41, order == 4, outcome_status == "Winner" ) > count(season, episode, order, challenge_type, name = "n_winners") boot_mapping > filter( season == 41, order == 4 ) > count(season, episode, order, name = "n_challengers") > left_join(df_challenges, by = c("season", "episode", "order"))
This table comes in hand for many types of analysis. Please see the documentation for detailed descriptions of the fields.
This is more of a reminder the package also includes ggplot fill and colour scales based on the season logo and tribe colours. Season 42 season logo and tribe colour palletes have been added. To use the colours from a particular season simply use scale_*_survivor(<season number>)
or scale_*_tribes(<season number>)
library(survivoR) library(tidyverse) df_results < castaways > mutate( result = case_when( str_detect(result, "Sole") ~ "Sole Survivor", str_detect(result, "unner") ~ "Finalist", str_detect(jury_status, "jury") ~ "Jury", TRUE ~ "Other" ), result = factor(result, levels = c("Sole Survivor", "Finalist", "Jury", "Other")) ) > distinct(version_season, castaway_id, result) vote_history > filter(!is.na(vote_id)) > left_join(df_results, by = c("version_season", "vote_id" = "castaway_id")) > count(order, result) > ggplot(aes(order, n, fill = result)) + geom_col() + scale_x_continuous(breaks = 1:20, labels = 1:20) + labs( title = "Total number of votes across 42 seasons", subtitle = "Distribution of votes by boot order and result", x = "Boot order", y = "Number of votes received", fill = "Result" ) + scale_fill_survivor(42) + theme_minimal()
confessionals > left_join(df_results, by = c("version_season", "castaway_id")) > group_by(episode, result) > summarise(n = sum(confessional_count)) > ggplot(aes(episode, n, fill = result)) + geom_col() + scale_x_continuous(breaks = 1:16, labels = 1:16) + labs( title = "Total number of confessionals across 42 seasons", subtitle = "Distribution of confessionals by episode and result", x = "Episode", y = "Number of confessionals", fill = "Result" ) + scale_fill_tribes(16, reverse = TRUE) + theme_minimal()
The post survivoR v1.0 is now on CRAN appeared first on Dan Oehm  Gradient Descending.
The post How to make a rounded corner bar plot in R? appeared first on
Rounded corner bar plot in R, we’ll show you how to use the ggchicklet package in the R programming language to make a ggplot2 bar chart with rounded bars.
The ggchicklet Package: An Overview
Bob Rudis’ ggchicklet package includes utilities for creating rounded rectangle segmented column charts (often known as “chicklets”).
Let’s dive into the topic.
We’ll teach you how to make roundedcorner barplots in R using the ggplot2 and ggchicklet package.
We must first create some data for this example.
df < data.frame(value = 1:5, group = LETTERS[1:5]) df value group 1 1 A 2 2 B 3 3 C 4 4 D 5 5 E
We constructed a data frame with a numeric and a character column by running the prior code.
The ggplot2 package must then be loaded.
How to Use the Multinomial Distribution in R?
install.packages("ggplot2") library("ggplot2")
We can now use the geom col method to create a conventional ggplot2 barplot, as illustrated below.
ggplot(data1, aes(group, value)) + geom_col()
The barchart depicted in Figure 1 was constructed after executing the previous R programming code.
The corners of the bars are not yet rounded, as you can see.
We must also install and load the ggchicklet package in order to produce a bar graph with round corners.
install.packages("ggchicklet", repos = "https://cinc.rud.is") library("ggchicklet")
We can now construct a barplot with round corners using the ggchicklet package’s geom chicklet function.
The radius argument in the geom chicklet function can be used to indicate how round the corners should be.
Rejection Region in Hypothesis Testing
ggplot(data1, aes(group, value)) + geom_chicklet(radius = grid::unit(3, "mm"))
As you can see, we’ve made another ggplot2 barplot, but this time we’ve rounded the edges of the bars.
When developing stacked barplots, we think rounded corners seem particularly petty.
To demonstrate this, we must first construct a new data set.
df2 < data.frame(value = 1:16, group = rep(LETTERS[1:4], each = 4), sub = letters[1:4]) df2 value group sub 1 1 A a 2 2 A b 3 3 A c 4 4 A d 5 5 B a 6 6 B b
The outcome of the previous code is shown above: A data frame with group and subgroup columns, as well as a values column.
Best Data Science YouTube Tutorials Free to Learn
The geom chicklet function can then be used to create a stacked barplot of these data. Note that the fill parameter is also specified within the aesthetics of our plot.
ggplot(df2, aes(group, value, fill = sub)) + geom_chicklet(radius = grid::unit(3, "mm"))
The post How to make a rounded corner bar plot in R? appeared first on
A few days ago, I released {ggblanket} onto CRAN.
This package took over my brain for 1.5 months, and I worked obsessively on it. I hope people find it useful.
The objective of {ggblanket} is to make beautiful {ggplot2} visualisation simpler.
With ...
A few days ago, I released {ggblanket} onto CRAN.
This package took over my brain for 1.5 months, and I worked obsessively on it. I hope people find it useful.
The objective of {ggblanket} is to make beautiful {ggplot2} visualisation simpler.
With this objective in mind, the {ggblanket} package:
library(dplyr) library(ggplot2) library(ggblanket) penguins2 < palmerpenguins::penguins %>% tidyr::drop_na() %>% mutate(body_mass_kg = body_mass_g / 1000) penguins2 %>% ggplot() + geom_histogram(aes(x = body_mass_kg))
penguins2 %>% gg_histogram(x = body_mass_kg)
penguins2 %>% group_by(species, sex, island) %>% summarise(body_mass_kg = mean(body_mass_kg)) %>% ggplot() + geom_col( aes(x = body_mass_kg, y = species, fill = sex), position = "dodge" ) + facet_wrap( ~ island) + theme(legend.position = "bottom")
penguins2 %>% group_by(species, sex, island) %>% summarise(body_mass_kg = mean(body_mass_kg)) %>% gg_col( x = body_mass_kg, y = species, col = sex, facet = island, position = "dodge", col_legend_place = "b" )
Other examples
storms %>% group_by(year) %>% summarise(wind = mean(wind, na.rm = TRUE)) %>% gg_line(x = year, y = wind, y_zero = TRUE, x_labels = ~.x, title = "Storm wind speed", subtitle = "USA average storm wind speed, 1975\u20132020", y_title = "Wind speed (knots)", caption = "Source: NOAA", theme = gg_theme(y_grid = TRUE)) + geom_point()
penguins2 %>% gg_density( x = body_mass_kg, col = species, facet = sex, col_legend_place = "b")
penguins2 %>% gg_jitter( x = species, y = body_mass_g, col = flipper_length_mm, col_intervals = ~santoku::chop_quantiles(.x, probs = seq(0, 1, 0.25)), position = position_jitter(width = 0.2, height = 0, seed = 123), y_zero = TRUE)
penguins2 %>% gg_smooth( x = bill_length_mm, y = flipper_length_mm, col = species, )
penguins2 %>% gg_histogram( x = body_mass_kg, col = species, facet = sex, col_legend_place = "b", pal = pals::brewer.dark2(3))
df < data.frame( trt = factor(c(1, 1, 2, 2)), resp = c(1, 5, 3, 4), group = factor(c(1, 2, 1, 2)), upper = c(1.1, 5.3, 3.3, 4.2), lower = c(0.8, 4.6, 2.4, 3.6) ) dodger < position_dodge(width = 0.75) gg_blank(df, x = resp, xmin = lower, xmax = upper, y = trt, col = group) + geom_col(position = dodger, width = 0.75, alpha = 0.9) + geom_errorbar(position = dodger, width = 0.2, col = "#232323")
For further information, see the ggblanket website.
This previous post was about the reconstruction of surfaces. Now I
implemented the Boolean operations on mes...
I?m still working on my package RCGAL, that I already present in a previous post.
This previous post was about the reconstruction of surfaces. Now I implemented the Boolean operations on meshes. Here are some simple examples.
The code generating these plots is given in the RCGAL examples.
Now let?s turn to a more interesting example.
The compound of five tetrahedra is provided by RCGAL.
These are five tetrahedra in a pretty configuration, each centered at
the origin. You can get their meshes by typing
tetrahedraCompound
. This is a list with two components: a
field meshes
providing for each tetrahderon its vertices
and its faces, and a field rglmeshes
, similar to
meshes
but these meshes are ready for plotting with the
rgl package. Here it is:
library(RCGAL) library(rgl) rglmeshes < tetrahedraCompound[["rglmeshes"]] open3d(windowRect = c(50, 50, 562, 562), zoom = 0.75) bg3d("#363940") colors < hcl.colors(5, palette = "Spectral") invisible(lapply( 1:5, function(i) shade3d(rglmeshes[[i]], color = colors[i]) ))
I wondered for a long time what is the intersection of these five tetrahedra. But I didn?t have any tool to compute it. Now I have. Let?s see.
# compute the intersection #### inter < MeshesIntersection( tetrahedraCompound[["meshes"]], numbersType = "lazyExact", clean = TRUE ) # plot #### open3d(windowRect = c(50, 50, 562, 562), zoom = 0.75) bg3d("#363940") # first the five tetrahedra with transparency #### invisible(lapply( rglmeshes, shade3d, color = "yellow", alpha = 0.1 )) # now the intersection #### rglinter < tmesh3d( "vertices" = t(inter[["vertices"]]), "indices" = t(inter[["faces"]]), "homogeneous" = FALSE ) shade3d(rglinter, color = "gainsboro") # and finally the edges #### plotEdges( inter[["vertices"]], inter[["exteriorEdges"]], only = inter[["exteriorVertices"]], color = "darkmagenta" )
Here is the result:
This is an icosahedron, I think.
Unfortunately, R CMD CHECK still throws some warnings which prevent me to publish this package on CRAN. I hope this issue will be solved.
Save the Date: 4th5th October
This October, Jumping Rivers will be holding our inperson Shiny in Production conference! Hosted in the centre of Newcastle Upon Tyne, UK, this conference will delv...
This October, Jumping Rivers will be holding our inperson Shiny in Production conference! Hosted in the centre of Newcastle Upon Tyne, UK, this conference will delve into the world of {shiny} and other web based R packages.
We have an excellent line up of expert speakers for you from a wide range of industries, as well as an afternoon of workshops hosted by our very own Jumping Rivers R pros.
Whether you’re a seasoned {shiny} user who wants to network and share knowledge, someone who’s just getting started and wants to learn from the experts, or anybody in between, if you’re interested in {shiny}, this conference is for you.
For more information take a look at our conference website.
Do you require help building a Shiny app? Would you like someone to take over the maintenance burden? If so, check out our Shiny and Dash services.
For updates and revisions to this article, see the original post
The post How to perform the KruskalWallis test in R? appeared first on .
How to perform the KruskalWallis test in R, when there are more than two groups, the KruskalWallis test by rank is a nonparametric alternative to the oneway ANOVA test.
It extends the twosamples Wilcoxon test. When the assumptions of the oneway ANOVA test are not met, this method is advised.
This article will show you how to use R to compute the KruskalWallis test.
We’ll use the PlantGrowth data set that comes with R. It provides the weight of plants produced under two distinct treatment conditions and a control condition.
data < PlantGrowth
Let’s print the head of the file
head(data) weight group 1 4.17 ctrl 2 5.58 ctrl 3 5.18 ctrl 4 6.11 ctrl 5 4.50 ctrl 6 4.61 ctrl
The column “group” is known as a factor in R, while the different categories (“ctr”, “trt1”, “trt2”) are known as factor levels. The levels are listed in alphabetical order.
Display group levels
levels(data$group) [1] "ctrl" "trt1" "trt2"
If the levels are not in the correct order automatically, reorder them as follows:
data$group < ordered(data$group, levels = c("ctrl", "trt1", "trt2"))
Summary statistics can be calculated by groupings. You can use the dplyr package.
Type this to install the dplyr package:
install.packages("dplyr")
Compute summary statistics by groups:
library(dplyr) group_by(data, group) %>% summarise( count = n(), mean = mean(weight, na.rm = TRUE), sd = sd(weight, na.rm = TRUE), median = median(weight, na.rm = TRUE), IQR = IQR(weight, na.rm = TRUE) )
Source: local data frame [3 x 6]
group count mean sd median IQR (fctr) (int) (dbl) (dbl) (dbl) (dbl) 1 ctrl 10 5.032 0.5830914 5.155 0.7425 2 trt1 10 4.661 0.7936757 4.550 0.6625 3 trt2 10 5.526 0.4425733 5.435 0.4675
Read R base graphs to learn how to utilize them. For easy ggplot2based data visualization, we’ll use the ggpubr R tool.
Download and install the most recent version of ggpubr.
install.packages("ggpubr")
Let’s plot weight by group and color by group
library("ggpubr") ggboxplot(my_data, x = "group", y = "weight", color = "group", palette = c("#00AFBB", "#E7B800", "#FC4E07"), order = c("ctrl", "trt1", "trt2"), ylab = "Weight", xlab = "Treatment")
Add error bars: mean_se
library("ggpubr") ggline(data, x = "group", y = "weight", add = c("mean_se", "jitter"), order = c("ctrl", "trt1", "trt2"), ylab = "Weight", xlab = "Treatment")
We want to see if the average weights of the plants in the three experimental circumstances vary significantly.
The test can be run using the kruskal.test() function as follows.
kruskal.test(weight ~ group, data = data)
KruskalWallis ranksum test
data: weight by group KruskalWallis chisquared = 7.9882, df = 2, pvalue = 0.01842
Inference
We can conclude that there are significant differences between the treatment groups because the pvalue is less than the significance criterion of 0.05.
Multiple pairwise comparisons between groups were conducted.
We know there is a substantial difference between groups based on the KruskalWallis test’s results, but we don’t know which pairings of groups are different.
The function pairwise.wilcox.test() can be used to calculate pairwise comparisons between group levels with different testing corrections.
pairwise.wilcox.test(PlantGrowth$weight, PlantGrowth$group, p.adjust.method = "BH")
Pairwise comparisons using the Wilcoxon ranksum test
How to perform a onesample ttest in R?
data: PlantGrowth$weight and PlantGrowth$group ctrl trt1 trt1 0.199  trt2 0.095 0.027
pvalue adjustment method: BH
Only trt1 and trt2 are statistically different (p 0.05) in the pairwise comparison.
The post How to perform the KruskalWallis test in R? appeared first on .
Would you like to discover the basics of R programming and get solid foundations for future learning? Even if you are an absolute beginner, the workshop “R basics  ob...
Continue reading: ‘R basics – objects, functions and operations’ workshop]]>Take the first R steps and enhance your data science toolkit with us on June 8th!
Would you like to discover the basics of R programming and get solid foundations for future learning? Even if you are an absolute beginner, the workshop “R basics – objects, functions and operations” workshop is designed to give you an overview of the basis of the syntax, common functions and basic operations, including how to get help and use packages available on CRAN and GitHub.
On Wednesday June 8th, at 2 p.m. CEST, this 3.5hours handson workshop is the first step to take, knowing the basis of the R language will help you build effective solid foundations for the following more advanced steps. Our industry experts use R daily in a professional way and will introduce you to the R language both from a technical and practical side. You will put in practice what taught doing many shorts exercises during the workshop.
A handson workshop to learn R by programming R.
Register at this link before 27/05 and benefit from the early bird discount. Register more people to get a discount from the second attendee.
Also remember that this workshop is part of the “Introduction to R” learning path. Register for the series at an attractive price!
If you have any questions please contact us at training@miraisolutions.com or fill online our “Feedback form”.
NOTE: the early bird discount for the workshop “Git & GitHub (for the R User)” is due to expire today at the end of the day, do not miss the chance to register for this workshop at a discounted price.
Even when filtering by the relatively sober #rstats hashtag, I find twitter to be the stream of consciousness of an undisciplined collective mind: disjoint and ephemeral. Nevertheless, on any given day some useful R resources float by, and it is frequently the case that interesting items disappear downstream before I can record them. Here are a few I did manage to fish out recently.
On May 12, @AedinCulhane announced that Bioconductor is seeking new members for its advisory board. Read about the positions, the process and make your nominations here.
@epiRhandbook celebrated the one year anniversary of the Epi R Handbook.
@Physacourses tweeted about the reactablefmtr package which allows for the creation of interactive data tables in R.
On May 10, @tobigerstenberg posted the online book of his notes along with slides for the graduate level Psych Stats class he teaches at Stanford.
On May 9, @mdsumner called out @carroll_jono’s post Where for (loop) ARt Thou?.
On May 8,
@eddelbuettel announced the new r2u: R Binaries for Ubuntu repo and demo of a full brms
installation in 13 seconds .
On May 7, @robinlovelace updated Geographic data in R, a book on how to get started with R for geographic research.
@RosanaFerrero pointed to Missingdata imputation from Gellman and Hill’s book and the FES 720 Introduction to R on the importance of missing values.
On May 5, @sharon000 pointed to @rlmcelreath’s Statistical Rethinking repo with links to free video lectures.
An accessible website is more than putting content online. Making a
website accessible means ensuring that it can be used by as many people
as possible. Accessibility standards such as the Web Con...
An accessible website is more than putting content online. Making a website accessible means ensuring that it can be used by as many people as possible. Accessibility standards such as the Web Content Accessibility Guidelines (WCAG) help to standardise the way in which a website can interact with assistive technologies. Allowing developers to incorporate instructions into their web applications which can be interpreted by technologies such as screen readers helps to maintain a consistent user experience for all.
Do you require help building a Shiny app? Would you like someone to take over the maintenance burden? If so, check out our Shiny and Dash services.
Data scientists often prepare web based content around data driven insight. This might be through reports created using technologies like {rmarkdown}, perhaps GUI front ends to expose model and data APIs built using {flask} or {plumber}, or applications to facilitate analyses with {shiny} or {dash}. These outputs are created with the intention of being used by others, giving capacity to users to derive meaning and value from data and statistical or mathematical models. My users might be key stakeholders and decision makers at one of our clients, or indeed the general public. Maximising the ability for my users to gain the insight that a solution provides helps to guarantee that, as a company, we are providing value and an impactful service.
Certainly, at Jumping Rivers, as we build solutions for a number of public sector organisations, consideration of accessibility criteria has become an important part of development standards.
Meeting accessibility requirements has become an increasing area of focus for many developers. The accessibility regulations came into force for public sector bodies in the UK in September 2018, expanding upon the obligations to people who have a disability under the Equality Act. In the UK alone there are almost 2 million people classed as fully or partially blind, a further 1.5 million with a learning disability and another 11 million with some degree of hearing loss. Implementing WCAG, an approved ISO standard, is an excellent way of making sure that your website is up to par.
WCAG is a technical standard primarily aimed at web developers providing a set of testable guidelines arranged into 4 categories:
Web accessibility can add value beyond making sure that your content is able to be experienced by all.
For updates and revisions to this article, see the original post
The post How to perform the MANOVA test in R? appeared first on
How to perform the MANOVA test in R?. when there are several response variables, a multivariate analysis of variance can be used to examine them all at once (MANOVA).
This article explains how to use R to compute manova.
For example, we might run an experiment in which we give two groups of mice two treatments (A and B) and measure their weight and height.
The weight and height of mice are two dependent variables in this example, and our hypothesis is that the difference in treatment affects both.
This hypothesis could be tested using a multivariate analysis of variance.
MANOVA assumptions MANOVA can be applied in the following situations:
Within groups, the dependent variables should be distributed regularly. The ShapiroWilk test for multivariate normality can be performed with the R function mshapiro.test() [from the mvnormtest package].
This is especially important when using MANOVA, which implies multivariate normality.
Variance homogeneity across a wide variety of predictors.
All dependent variable pairings, all covariate pairs, and all dependent variablecovariate pairs in each cell are linear.
We conclude that the associated impact (treatment) is significant if the global multivariate test is significant.
The next step is to figure out whether the treatment impacts only the weight, only the height, or both. To put it another way, we want to figure out which dependent variables led to the substantial global effect.
We can evaluate each dependent variable separately using oneway ANOVA (or univariate ANOVA) to address this question.
data < iris
Using the sample n() function in the dplyr package, the R code below displays a random sample of our data.
Install dplyr first if you don’t already have it.
How to perform OneSample Wilcoxon Signed Rank Test in R?
install.packages("dplyr") set.seed(123) dplyr::sample_n(data, 10) Sepal.Length Sepal.Width Petal.Length Petal.Width Species 1 4.3 3.0 1.1 0.1 setosa 2 5.0 3.3 1.4 0.2 setosa 3 7.7 3.8 6.7 2.2 virginica 4 4.4 3.2 1.3 0.2 setosa 5 5.9 3.0 5.1 1.8 virginica 6 6.5 3.0 5.2 2.0 virginica 7 5.5 2.5 4.0 1.3 versicolor 8 5.5 2.6 4.4 1.2 versicolor 9 5.8 2.7 5.1 1.9 virginica 10 6.1 3.0 4.6 1.4 versicolor
Question: We’re curious if there’s a major difference in sepal and petal length across the various species.
The manova() function can be used as follows:
sepl < iris$Sepal.Length petl < iris$Petal.Length
MANOVA test
res.man < manova(cbind(Sepal.Length, Petal.Length) ~ Species, data = iris) summary(res.man) Df Pillai approx F num Df den Df Pr(>F) Species 2 0.9885 71.829 4 294 < 2.2e16 *** Residuals 147  Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Look for the differences.
summary.aov(res.man) Response Sepal.Length : Df Sum Sq Mean Sq F value Pr(>F) Species 2 63.212 31.606 119.26 < 2.2e16 *** Residuals 147 38.956 0.265  Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Response Petal.Length : Df Sum Sq Mean Sq F value Pr(>F) Species 2 437.10 218.551 1180.2 < 2.2e16 *** Residuals 147 27.22 0.185  Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The two variables are extremely significantly different among Species, as shown in the result above.
The post How to perform the MANOVA test in R? appeared first on
Hello everyone! I am psyched to announce the launch of my free workshop about how to learn R. It’s been a long time in the making, but it is finally here. The workshop is called The Myth of the R Learning Curve (or how not to go crazy when learning R).
In the workshop, I go over my own personal story and how I came to love and learn R. I also talk about why R is such a powerful tool that brings you to the cutting edge of science. Some of the other key topics I cover in the workshop include:
I feel strongly about the fact that it doesn’t need to take years and expensive university courses to feel comfortable working with R. I have taught hundreds of students and I am excited to share what I’ve learned along the way. Don’t let the R learning curve stand in the way of doing good science. Learning R can be faster, more fun, and easier than you thought.
If you watch it to the very end, I’ll be sharing a cool bonus surprise so that you never have another reason to say you don’t know R.
To watch my free workshop, just click here to enter your email and access the workshop.
I look forward to seeing you there!
~ Luka
A rather curious question on X validated about the evolution of
when $M$ increases. Actually, this expectation is asymptotically equivalent to
or again
where the average is made of Pareto (1,1), since one can invoke Slutsky many times. (And the above comparison of the integrated rv’s does not show a major difference.) Comparing several Monte Carlo sequences shows a lot of variability, though, which is not surprising given the lack of expectation of the Pareto (1,1) distribution. But over the time I spent on that puzzle last week end, I could not figure the limiting value, despite uncovering the asymptotic behaviour of the average.