(This article was first published on ** Learn R Programming & Build a Data Science Career | Michael Toth**, and kindly contributed to R-bloggers)

When it comes to data visualization, flashy graphs can be fun. But if you’re trying to convey information, especially to a broad audience, flashy isn’t always the way to go.

Last week I showed how to work with line graphs in R.

In this article, I’m going to talk about creating a scatter plot in R. Specifically, we’ll be creating a `ggplot`

scatter plot using `ggplot`

‘s `geom_point`

function.

A scatter plot is a two-dimensional data visualization that uses points to graph the values of two different variables – one along the x-axis and the other along the y-axis. Scatter plots are often used when you want to assess the relationship (or lack of relationship) between the two variables being plotted.

For example, in this graph, FiveThirtyEight uses Rotten Tomatoes ratings and Box Office gross for a series of Adam Sandler movies to create this scatter plot. They’ve additionally grouped the movies into 3 categories, highlighted in different colors.

This scatter plot, initially created by Hans Rosling, is famous among data visualization practitioners. It graphs the life expectancy vs. income for countries around the world. It also uses the size of the points to map country population and the color of the points to map continents, adding 2 additional variables to the traditional scatter plot.

Hans Rosling used a famously provocative and animated presentation style to make this data come alive. He used his presentations to advocate for sustainable global development through the Gapminder Foundation.

Hans Rosling’s example shows how simple graphic styles can be powerful tools for communication and change when used properly! Convinced? Let’s dive into this guide to creating a ggplot scatter plot in R!

I’ve created a free workbook to help you apply what you’re learning as you read.

The workbook is an R file that contains all the code shown in this post as well as additional questions and exercises to help you understand the topic even deeper.

If you want to really learn how to create a scatter plot in R so that you’ll still remember weeks or even months from now, you need to practice.

So Download the workbook now and practice as you read this post!

Before we get into the ggplot code to create a scatter plot in R, I want to briefly touch on `ggplot`

and why I think it’s the best choice for plotting graphs in R.

`ggplot`

is a package for creating graphs in R, but it’s also a method of thinking about and decomposing complex graphs into logical subunits.

`ggplot`

takes each component of a graph–axes, scales, colors, objects, etc–and allows you to build graphs up sequentially one component at a time. You can then modify each of those components in a way that’s both flexible and user-friendly. When components are unspecified, `ggplot`

uses sensible defaults. This makes `ggplot`

a powerful and flexible tool for creating all kinds of graphs in R. It’s the tool I use to create nearly every graph I make these days, and I think you should use it too!

Throughout this post, we’ll be using the `mtcars`

dataset that’s built into R. This dataset contains details of design and performance for 32 cars. Let’s take a look to see what it looks like:

The mtcars dataset contains 11 columns:

`mpg`

: Miles/(US) gallon`cyl`

: Number of cylinders`disp`

: Displacement (cu.in.)`hp`

: Gross horsepower`drat`

: Rear axle ratio`wt`

: Weight (1000 lbs)`qsec`

: 1/4 mile time`vs`

: Engine (0 = V-shaped, 1 = straight)`am`

: Transmission (0 = automatic, 1 = manual)`gear`

: Number of forward gears`carb`

: Number of carburetors

`ggplot`

uses geoms, or geometric objects, to form the basis of different types of graphs. Previously I talked about geom_line, which is used to produce line graphs. Today I’ll be focusing on `geom_point`

, which is used to create scatter plots in R.

```
library(tidyverse)
ggplot(mtcars) +
geom_point(aes(x = wt, y = mpg))
```

Here we are starting with the simplest possible `ggplot`

scatter plot we can create using `geom_point`

. Let’s review this in more detail:

First, I call `ggplot`

, which creates a new `ggplot`

graph. It’s essentially a blank canvas on which we’ll add our data and graphics. In this case, I passed mtcars to `ggplot`

to indicate that we’ll be using the mtcars data for this particular `ggplot`

scatter plot.

Next, I added my `geom_point`

call to the base `ggplot`

graph in order to create this scatter plot. In `ggplot`

, you use the `+`

symbol to add new layers to an existing graph. In this second layer, I told `ggplot`

to use `wt`

as the x-axis variable and `mpg`

as the y-axis variable.

And that’s it, we have our scatter plot! It shows that, on average, as the weight of cars increase, the miles-per-gallon tends to fall.

Expanding on this example, we can now play with colors in our scatter plot.

```
ggplot(mtcars) +
geom_point(aes(x = wt, y = mpg), color = 'blue')
```

You’ll note that this `geom_point`

call is identical to the one before, except that we’ve added the modifier `color = 'blue'`

to to end of the line. Experiment a bit with different colors to see how this works on your machine. You can use most color names you can think of, or you can use specific hex colors codes to get more granular.

Now, let’s try something a little different. Compare the `ggplot`

code below to the code we just executed above. There are 3 differences. See if you can find them and guess what will happen, then scroll down to take a look at the result. If you’ve read my previous `ggplot`

guides, this bit should look familiar!

```
mtcars$am <- factor(mtcars$am)
ggplot(mtcars) +
geom_point(aes(x = wt, y = mpg, color = am))
```

This graph shows the same data as before, but now there are two different colors! The red dots correspond to automatic transmission vehicles, while the blue dots represent manual transmission vehicles. Did you catch the 3 changes we used to change the graph? They were:

- First, we converted the
`am`

variable to a factor. What do you think happens if we don’t do this? Give it a try! - Instead of specifying
`color = 'blue'`

, we specified`color = am`

- We moved the color parameter inside of the
`aes()`

parentheses

Let’s review each of these changes:

`am`

variable to a factorIn the dataset, `am`

was initially a numeric variable. You can check this by running `class(mtcars$am)`

. When you pass a numeric variable to a color scale in `ggplot`

, it creates a continuous color scale.

In this case, however, there are only 2 values for the `am`

field, corresponding to automatic and manual transmission. So it makes our graph more clear to use a discrete color scale, with 2 color options for the two values of `am`

. We can accomplish this by converting the `am`

field from a numeric value to a factor, as we did above.

On your own, try graphing both with and without this conversion to factor. If you’ve already converted to factor, you can reload the dataset by running `data(mtcars)`

to try graphing as numeric!

This point is a bit tricky. Check out my workbook for this post for a guided exploration of this issue in more detail!

`color = am`

and moving it within the `aes()`

parenthesesI’m combining these because these two changes work together.

Before, we told `ggplot`

to change the color of the points to blue by adding `color = 'blue'`

to our `geom_point()`

call.

What we’re doing here is a bit more complex. Instead of specifying a single color for our points, we’re telling `ggplot`

to *map* the data in the `am`

column to the `color`

aesthetic.

This means we are telling `ggplot`

to use a different color for each value of `am`

in our data! This mapping also lets `ggplot`

know that it also needs to create a legend to identify the transmission types, and it places it there automatically!

Let’s look at a related example. This time, instead of changing the color of the points in our scatter plot, we will change the shape of the points:

```
mtcars$am <- factor(mtcars$am)
ggplot(mtcars) +
geom_point(aes(x = wt, y = mpg, shape = am))
```

The code for this `ggplot`

scatter plot is identical to the code we just reviewed, except we’ve substituted `shape`

for `color`

. The graph produced is quite similar, but it uses different shapes (triangles and circles) instead of different colors in the graph. You might consider using something like this when printing in black and white, for example.

`aes()`

(aesthetic) mappings in ggplotWe just saw how we can create graphs in `ggplot`

that map the `am`

variable to color or shape in a scatter plot. `ggplot`

refers to these mappings as *aesthetic* mappings, and they include everything you see within the `aes()`

in ggplot.

Aesthetic mappings are a way of mapping *variables in your data* to particular *visual properties* (aesthetics) of a graph.

I know this can sound a bit theoretical, so let’s review the specific aesthetic mappings you’ve already seen as well as the other mappings available within geom_point.

The main aesthetic mappings for a ggplot scatter plot include:

`x`

: Map a variable to a position on the x-axis`y`

: Map a variable to a position on the y-axis`color`

: Map a variable to a point color`shape`

: Map a variable to a point shape`size`

: Map a variable to a point size`alpha`

: Map a variable to a point transparency

From the list above, we’ve already seen the `x`

, `y`

, `color`

, and `shape`

aesthetic mappings.

`x`

and `y`

are what we used in our first `ggplot`

scatter plot example where we mapped the variables `wt`

and `mpg`

to x-axis and y-axis values. Then, we experimented with using `color`

and `shape`

to map the `am`

variable to different colored points or shapes.

In addition to those, there are 2 other aesthetic mappings commonly used with `geom_point`

. We can use the `alpha`

aesthetic to change the transparency of the points in our graph. Finally, the `size`

aesthetic can be used to change the size of the points in our scatter plot.

Note there are two additional aesthetic mappings for ggplot scatter plots, `stroke`

, and `fill`

, but I’m not going to cover them here. They’re only used for particular `shapes`

, and have very specific use cases beyond the scope of this guide.

`size`

aesthetic mapping in a ggplot scatter plot```
ggplot(mtcars) +
geom_point(aes(x = wt, y = mpg, size = cyl))
```

In the code above, we map the number of cylinders (`cyl`

), to the size aesthetic in ggplot. Cars with more cylinders display as larger points in this graph.

Note: A scatter plot where the size of the points vary based on a variable in the data is sometimes called a bubble chart. The scatter plot above could be considered a bubble chart!

In general, we see that cars with more cylinders tend to be clustered in the bottom right of the graph, with larger weights and lower miles per gallon, while those with fewer cylinders are on the top left. That said, it’s a bit hard to make out all the points in the bottom right corner. How can we solve that issue? Let’s learn more about the alpha aesthetic to find out!

`alpha`

aesthetic```
ggplot(mtcars) +
geom_point(aes(x = wt, y = mpg, alpha = cyl))
```

In this code we’ve mapped the alpha aesthetic to the variable `cyl`

. Cars with fewer cylinders appear more transparent, while those with more cylinders are more opaque. But in this case, I don’t think this helps us to understand relationships in the data any better. Instead, it just seems to highlight the points on the bottom right. I think this is a bad graph!

How else can we use the alpha aesthetic to improve the readability of our graph? Let’s turn back to our code from above where we mapped the cylinders to the size variable, creating what I called a bubble chart. Remember how it was difficult to make out all of the cars in the bottom right? What if we made all of the points in the graph semi-transparent so that we can see through the bubbles that are overlapping? Let’s try!

```
ggplot(mtcars) +
geom_point(aes(x = wt, y = mpg, size = cyl), alpha = 0.3)
```

This makes it much easier to see the clustering of larger cars in the bottom right while not reducing the importance of those points in the top left! This is my favorite use of the alpha aesthetic in ggplot: adding transparency to get more insight into dense regions of points.

Above, we saw that we are able to use `color`

in two different ways with geom_point. First, we were able to set the color of our points to blue by specifying `color = 'blue'`

*outside* of our `aes()`

mappings. Then, we were able to *map* the variable `am`

to color by specifying `color = am`

*inside* of our `aes()`

mappings.

Similarly, we saw two different ways to use the `alpha`

aesthetic as well. First, we *mapped* the variable `cyl`

to alpha by specifying `alpha = cyl`

*inside* of our `aes()`

mappings. Then, we set the alpha of all points to 0.3 by specifying `alpha = 0.3`

*outside* of our `aes()`

mappings.

What is the difference between these two ways of dealing with the aesthetic mappings available to us?

Each of the aesthetic mappings you’ve seen can also be used as a *parameter*, that is, a fixed value defined outside of the `aes()`

aesthetic mappings. You saw how to do this with color when we made the scatter plot points blue with `color = 'blue'`

above. Then, you saw how to do this with alpha when we set the transparency to 0.3 with `alpha = 0.3`

. Now let’s look at an example of how to do this with shape in the same manner:

```
ggplot(mtcars) +
geom_point(aes(x = wt, y = mpg), shape = 18)
```

Here, we specify to use shape 18, which corresponds to this diamond shape you see here. Because we specified this *outside* of the `aes()`

, this applies to all of the points in this graph!

To review what values `shape`

, `size`

, and `alpha`

accept, just run `?shape`

, `?size`

, or `?alpha`

from your console window! For even more details, check out `vignette("ggplot2-specs")`

When I was first learning R and ggplot, the difference between aesthetic mappings (the values included *inside* your `aes()`

), and parameters (the ones *outside* your `aes()`

) was constantly confusing me. Luckily, over time, you’ll find that this becomes second nature. But in the meantime, I can help you speed along this process with a few common errors that you can keep an eye out for.

`aes()`

callIf you’re trying to map the `cyl`

variable to `shape`

, you should include `shape = cyl`

within the `aes()`

of your `geom_point`

call. What happens if you include it outside accidentally, and instead run `ggplot(mtcars) + geom_point(aes(x = wt, y = mpg), shape = cyl)`

? You’ll get an error message that looks like this:

Whenever you see this error about object not found, be sure to check that you’re including your aesthetic mappings *inside* the `aes()`

call!

`aes()`

callOn the other hand, if we try including a specific parameter value (for example, `color = 'blue'`

) inside of the `aes()`

mapping, the error is a bit less obvious. Take a look:

```
ggplot(mtcars) +
geom_point(aes(x = wt, y = mpg, color = 'blue'))
```

In this case, `ggplot`

actually does produce a scatter plot, but it’s not what we intended.

For starters, the points are all red instead of the blue we were hoping for! Also, there’s a legend in the graph that simply says ‘blue’.

What’s going on here? Under the hood, `ggplot`

has taken the string ‘blue’ and effectively created a new hidden column of data where every value simple says ‘blue’. Then, it’s *mapped* that column to the color aesthetic, like we saw before when we specified `color = am`

. This results in the legend label and the color of all the points being set, not to blue, but to the default color in `ggplot`

.

If this is confusing, that’s okay for now. Just remember: when you run into issues like this, double check to make sure you’re including the parameters of your graph *outside* your `aes()`

call!

You should now have a solid understanding of how to create a scatter plot in R using the `ggplot`

scatter plot function, `geom_point`

!

Experiment with the things you’ve learned to solidify your understanding. You can download my free workbook with the code from this article to work through on your own.

I’ve found that working through code on my own is the best way for me to learn new topics so that I’ll actually remember them when I need to do things on my own in the future.

Download the workbook now to practice what you learned!

To **leave a comment** for the author, please follow the link and comment on their blog: ** Learn R Programming & Build a Data Science Career | Michael Toth**.

R-bloggers.com offers

(This article was first published on ** R-Bloggers – Learning Machines**, and kindly contributed to R-bloggers)

Like most people you will have used a search engine lately, like *Google*. But have you ever thought about how it manages to give you the most fitting results? How does it order the results so that the best are on top? Read on to find out!

The earliest search engines either had human curated indices, like *Yahoo!* or used some simple heuristic like the more often the keyword you were looking for was mentioned on a page the better, like *Altavista* – which led to all kinds of crazy effects like certain keywords being repeated thousands of times on webpages to make them more “relevant”.

Now, most of those search engines are long gone because a new kid arrived on the block: Google! Google’s search engine results were much better than all of the competition and they became the dominant player in no time. How did they do that?

The big idea was in fact published by the two founders: Sergey Brin and Lawrence Page, it is called the *pagerank algorithm* (which is of course a pun because one of the authors was named Page too). The original paper can be found here: S. Brin, L. Page: The Anatomy of a Large-Scale Hypertextual Web Search Engine.

Let us start with another, related question: which properties are the best to own in Monopoly? Many would instinctively answer with the most expensive ones, i.e. Park Place and Boardwalk. But a second thought reveals that those might be the the ones where you get the biggest rent if somebody lands on them but that the last part is the caveat… “IF” somebody lands on them! The best streets are actually the ones where players land on the most. Those happen to be the orange streets, St. James Place, Tennessee Avenue and New York Avenue and therefore they are the key to winning the game.

How do find those properties? For example by simulation: you just simulate thousands of dice rolls and see where the players land.

A similar idea holds true for finding the best web pages: you just start from a random position and simulate a surfer who visits different web pages by chance. For each surfing session you tally the respective webpage where she ends up and after many runs we get a percentage for each page. The higher this percentage is the more relevant the webpage!

Let us do this with some R code. First we define a very small net and plot it (the actual example can be found in chapter 30 of the very good book “Chaotic Fishponds and Mirror Universes” by Richard Elwes):

library(igraph) ## ## Attaching package: 'igraph' ## The following objects are masked from 'package:stats': ## ## decompose, spectrum ## The following object is masked from 'package:base': ## ## union # cols represent outgoing links, rows incoming links # A links to C, D; B links to A; C links to A; D links to A,B,C M <- matrix(c(0, 0, 1, 1, 1, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1, 0), nrow = 4) colnames(M) <- rownames(M) <- c("A", "B", "C", "D") M ## A B C D ## A 0 1 1 1 ## B 0 0 0 1 ## C 1 0 0 1 ## D 1 0 0 0 g <- graph_from_adjacency_matrix(t(M)) # careful with how the adjacency matrix is defined -> transpose of matrix plot(g)

Now, we are running the actual simulation. We define two helper functions for that, `next_page`

for getting a random but possible next page given the page our surfer is on at the moment and `last_page`

which gives the final page after `N`

clicks:

next_page <- function(page, graph) { l <- sample(rownames(graph)[as.logical(graph[ , as.logical(page)])], 1) as.numeric(rownames(graph) == l) } last_page <- function(page, graph, N = 100) { for (i in seq(N)) { page <- next_page(page, graph) } page } current_page <- c(1, 0, 0, 0) # random surfer starting from A random_surfer <- replicate(2e4, last_page(current_page, M, 50)) round(rowSums(random_surfer) / sum(random_surfer), 2) ## [1] 0.43 0.07 0.28 0.22

So we see that page A is the most relevant one because our surfer ends up being there in more than 40% of all sessions, after that come the pages C, D and B. When you look at the net that makes sense, since all pages refer to A whereas B gets only one link, so it doesn’t seem to be that relevant.

As you have seen the simulation even for this small net took quite long so we need some clever mathematics to speed up the process. One idea is to transform our matrix which represents the network into a matrix which gives the probabilities of landing on the next pages and then multiply the probability matrix with the current position (and thereby transform the probabilities). Let us do this for the first step:

M_prob <- prop.table(M, 2) # create probability matrix M_prob ## A B C D ## A 0.0 1 1 0.3333333 ## B 0.0 0 0 0.3333333 ## C 0.5 0 0 0.3333333 ## D 0.5 0 0 0.0000000 M_prob %*% current_page ## [,1] ## A 0.0 ## B 0.0 ## C 0.5 ## D 0.5

The result says that there is a fifty-fifty chance of landing on C or D. When you look at the graph you see that this is correct since there are two links, one to C and one to D! For the next step you would have to multiply the matrix with the result again, or first multiply the matrix with itself before multiplying with the current position, which gives:

If we want to do this a hundred times we just have to raise this probability matrix to the one hundredth power:

We use the `%^%`

operator in the `expm`

package (on CRAN) for that:

library(expm) ## Loading required package: Matrix ## ## Attaching package: 'expm' ## The following object is masked from 'package:Matrix': ## ## expm r <- M_prob %^% 100 %*% current_page r ## [,1] ## A 0.42857143 ## B 0.07142857 ## C 0.28571429 ## D 0.21428571

Again, we get the same result! You might ask: why ? The answer is that this is in most cases enough to get a stable result so that any further multiplication still results in the same result:

The last equations opens up still another possibility: we are obviously looking for a vector which goes unaffected when multiplied by the matrix . There is a mathematical name for that kind of behaviour: *eigenvector*! As you might have guessed the name is an import from the German language where it means something like “own vector”.

This hints at the problem we were solving all along (without consciously realizing perhaps): a page is the more relevant the more relevant a page is that links to it… now we have to know the importance of that page but that page two is the more relevant… and so on and so forth, we are going in circles here. The same is true when you look at the equation above: you define in terms of – is the eigenvector of matrix !

There are very fast and powerful methods to find the eigenvectors of a matrix, and the corresponding `eigen`

function is even a function in base R:

lr <- Re(eigen(M_prob)$vectors[ , 1]) # real parts of biggest eigenvector lr / sum(lr) # normalization ## [1] 0.42857143 0.07142857 0.28571429 0.21428571

Again, the same result! You can now understand the title of this post and titles of other articles about the pagerank algorithm and Google like “The $25,000,000,000 eigenvector”.

Yet, a word of warning is in order: there are cases where the probability matrix is not diagonalizable (we won’t get into the mathematical details here), which means that the eigenvector method won’t give sensible results. To check this the following code must evaluate to `TRUE`

:

ev <- eigen(M_prob)$values length(unique(ev)) == length(ev) ## [1] TRUE

We now repeat the last two methods for a bigger network:

set.seed(1415) n <- 10 g <- sample_gnp(n, p = 1/4, directed = TRUE) # create random graph g <- set_vertex_attr(g, "name", value = LETTERS[1:n]) plot(g)

M <- t(as_adjacency_matrix(g, sparse = FALSE)) M_prob <- prop.table(M, 2) # create probability matrix M_prob ## A B C D E F G H I J ## A 0.00 0 0 1 0.5 0.5 0.5 0.0000000 0.0000000 0.5 ## B 0.00 0 0 0 0.0 0.0 0.0 0.3333333 0.0000000 0.0 ## C 0.00 1 0 0 0.0 0.0 0.0 0.0000000 0.3333333 0.5 ## D 0.25 0 0 0 0.0 0.0 0.0 0.0000000 0.0000000 0.0 ## E 0.25 0 0 0 0.0 0.0 0.5 0.3333333 0.3333333 0.0 ## F 0.00 0 1 0 0.0 0.0 0.0 0.0000000 0.3333333 0.0 ## G 0.25 0 0 0 0.0 0.0 0.0 0.0000000 0.0000000 0.0 ## H 0.00 0 0 0 0.5 0.0 0.0 0.0000000 0.0000000 0.0 ## I 0.00 0 0 0 0.0 0.5 0.0 0.0000000 0.0000000 0.0 ## J 0.25 0 0 0 0.0 0.0 0.0 0.3333333 0.0000000 0.0 current_page <- c(1, rep(0, n-1)) r <- M_prob %^% 100 %*% current_page r ## [,1] ## A 0.27663574 ## B 0.02429905 ## C 0.08878509 ## D 0.06915881 ## E 0.14579434 ## F 0.10654199 ## G 0.06915881 ## H 0.07289723 ## I 0.05327107 ## J 0.09345787 lr <- Re(eigen(M_prob)$vectors[ , 1]) lr / sum(lr) # normalization of the real parts ## [1] 0.27663551 0.02429907 0.08878505 0.06915888 0.14579439 0.10654206 ## [7] 0.06915888 0.07289720 0.05327103 0.09345794

We can now order the pages according to their importance – like the first 10 results of a google search:

search <- data.frame(Page = LETTERS[1:n], Rank = r) search[order(search$Rank, decreasing = TRUE), ] ## Page Rank ## A A 0.27663574 ## E E 0.14579434 ## F F 0.10654199 ## J J 0.09345787 ## C C 0.08878509 ## H H 0.07289723 ## D D 0.06915881 ## G G 0.06915881 ## I I 0.05327107 ## B B 0.02429905

Looking at the net, does the resulting order make sense to you?

Congratulations, you now understand the big idea behind one the greatest revolutions in information technology!

To **leave a comment** for the author, please follow the link and comment on their blog: ** R-Bloggers – Learning Machines**.

R-bloggers.com offers

(This article was first published on ** R on easystats**, and kindly contributed to R-bloggers)

Many times, for instance when teaching, I needed to quickly and simply generate a **perfectly normally distributed sample** to illustrate or show some of its characteristics.

This is now very easy to do with the new `bayestestR`

package, which includes the `rnorm_perfect`

function. This function is very similar to the classic `rnorm`

(same arguments), with the difference that the generated sample is *perfectly* normal.

`bayestestR`

can be installed as follows:

```
install.packages("bayestestR") # Install the package
library(bayestestR) # Load it
```

```
# Generate a perfect sample
x <- rnorm_perfect(n = 100, mean = 0, sd = 1)
# Visualise it
library(tidyverse)
x %>%
density() %>% # Compute density function
as.data.frame() %>%
ggplot(aes(x=x, y=y)) +
geom_line()
```

We can also easily color some of the parts of the curve, for instance, the observations lying beyond +2 standard deviations.

```
x %>%
density() %>% # Compute density function
as.data.frame() %>%
mutate(outlier = ifelse(x > 2, "Extreme", "Not extreme")) %>%
ggplot(aes(x=x, y=y, fill=outlier)) +
geom_ribbon(aes(ymin=0, ymax=y)) +
theme_classic()
```

More details about `bayestestR`

’s features are comming soon, stay tuned

**Don’t forget to check out the****documentation here****for more!**

Feel free to let us know how we could further improve this package! Also, note that *easystats*, the project supporting `bayestestR`

is in active development. Thus, do not hesitate to contact us if **you want to get involved **

**Check out our other blog posts**!*here*

To **leave a comment** for the author, please follow the link and comment on their blog: ** R on easystats**.

R-bloggers.com offers

(This article was first published on ** rOpenSci - open tools for open science**, and kindly contributed to R-bloggers)

Last month we released a new version of pdftools and a new companion package qpdf for working with pdf files in R. This release introduces the ability to perform pdf transformations, such as splitting and combining pages from multiple files. Moreover, the `pdf_data()`

function which was introduced in pdftools 2.0 is now available on all major systems.

It is now possible to split, join, and compress pdf files with pdftools. For example the `pdf_subset()`

function creates a new pdf file with a selection of the pages from the input file:

```
# Load pdftools
library(pdftools)
# extract some pages
pdf_subset('https://cran.r-project.org/doc/manuals/r-release/R-intro.pdf',
pages = 1:3, output = "subset.pdf")
# Should say 3
pdf_length("subset.pdf")
```

Similarly `pdf_combine()`

is used to join several pdf files into one.

```
# Generate another pdf
pdf("test.pdf")
plot(mtcars)
dev.off()
# Combine them with the other one
pdf_combine(c("test.pdf", "subset.pdf"), output = "joined.pdf")
# Should say 4
pdf_length("joined.pdf")
```

The split and join features are provided via a new package qpdf which wraps the qpdf C++ library. The main benefit of qpdf is that no external software (such as pdftk) is needed. The qpdf package is entirely self contained and works reliably on all operating systems with zero system dependencies.

The pdftools 2.0 announcement post from December introduced the new `pdf_data()`

function for extracting individual text boxes from pdf files. However it was noted that this function was not yet available on most Linux distributions because it requires a recent fix from poppler 0.73.

I am happy to say that this should soon work on all major Linux distributions. Ubuntu has upgraded to poppler 0.74 on Ubuntu Disco which was released this week. I also created a PPA for Ubuntu 16.04 (Xenial) and 18.04 (Bionic) with backports of poppler 0.74. This makes it possible to use `pdf_data`

on Ubuntu LTS servers, including Travis:

```
sudo add-apt-repository ppa:opencpu/poppler
sudo apt-get update
sudo apt-get install libpoppler-cpp-dev
```

Moreover, the upcoming Fedora 30 will ship with poppler-devel 0.73.

Finally, the upcoming Debian “Buster” release will ship with poppler 0.71, but the Debian maintainers were nice enough to let me backport the required patch from poppler 0.73, so `pdf_data()`

will work on Debian (and hence CRAN) as well!

To **leave a comment** for the author, please follow the link and comment on their blog: ** rOpenSci - open tools for open science**.

R-bloggers.com offers

(This article was first published on ** R by R(yo)**, and kindly contributed to R-bloggers)

As the sakura bloomed in Tokyo, another TokyoR User

Meetup was held, this time

at SONY City! On April 13th useRs from all over Tokyo (and some even

from further afield) flocked to Osaki, Tokyo for a special session

focused on beginner R users, **BeginneRs**. We’ve also had several special

“themed” sessions in the past like TokyoR #69: Bayesian

statistics in June last year as well as

the TokyoR #70: Hadley Wickham + Joe Rickert

Special! last July. Although

mainly for BeginneRs, the LTs for this session were applications of

R/case studies and may be of more interest to non-beginners if you want

to skip ahead to that section.

Like my previous round up posts (for TokyoR

#76 and

JapanR Conference

#9)

I will be going over around half of all the talks. Hopefully, my efforts

will help spread the vast knowledge of Japanese R users to the wider R

community. Throughout I will also post helpful blog posts and links from

other sources if you are interested in learning more about the topic of

a certain talk. You can follow **Tokyo.R** by searching for the

#TokyoR hashtag on Twitter.

These were the same set of beginner user focused talks that happens at

the start of every TokyoR meetup:

- Basics of R Programming by

kotatyamtema - Data munging with the tidyverse by

y_mattu - Visualization with R by

koriakane

Since this was a special **BeginneR** session, the main talks were all

focused on even more introductory stuff on top of the usual beginner’s

session in the previous section.

First up, `@kilometer00`

talked about doing data analysis with R. As a

brief introduction he talked about what exactly IS data analysis, how we

might go about forming a research problem and hypothesis, and how these

steps are important in figuring out what kind of modeling we might

attempt to do on the data at hand. The modeling techniques

`@kilometer00`

covered were just the basics such as single/multiple

linear regression and PCA.

In the last part of the talk he covered nested data modelling. By using

`group_by()`

and `nest()`

we can create a data frame that has one row

per group (for example, a row for **each** species group in the iris

data set) with has a new column called `data`

which is itself a data

frame. Now, instead of having to repeat the modelling code over and over

again for each subset of rows (a model for **each** species in the iris

data set), by using the `purrr::map_*()`

family of functions you can

apply your model to each of the groups that you specified earlier.

Filled with great explanatory graphics, plots, and code the slides are a

good example of teaching the basics of modelling with R.

Some other resources about modelling with R:

- Model buidling chapter in R for Data Science – Hadley

Wickham - Picking the best model with caret – Rachael

Tatman - Introduction to Statistical Learning – Gareth James, Daniela

Witten, Trevor Hastie, & Rob

Tibshirani - Applied Predictive Modeling – Max Kuhn & Kjell

Johnson

Another TokyoR organizer, `@aad34210`

, talked about using the R Studio

IDE to maximize R’s strengths for programming and data analysis. After a

brief spiel on the early days of using R solely from the console he

talked about R Studio’s capabilities and the various options that can be

customized to suit your R needs. From installing R Studio, configuring

the four main panes, using R Projects, and using `Global options`

,

`aad34210`

opened up his own R Studio window and pointed out the various

menu options in thorough detail to help beginneRs navigate without

getting overwhelmed. He rounded off the talk by showing the various

Cheat sheets (included one for R Studio itself) that can be obtained

from the `Help`

tab.

Some other resources one might consider to learn R Studio are:

- Introduction to RStudio – Software

Carpentry - RStudio Essentials – RStudio
- RStudio IDE Cheat Sheet (shown above) –

RStudio

`@u_ribo`

gave a talk about the benefits of creating a reproducible and

enclosed R environment using git and Docker. As an instructor who has

ran many R workshops, `@u_ribo`

has ran into the problem of his learners

all having different OSs, versions of packages, and versions of R itself

causing much headache for all involved. This is also the case when

working in a team environment where analyses need to be shared and

reproducibility of results is essential.

To reduce these problems he gave a live-demo using a variety of R tools

such as the `here`

package, the `usethis`

package, and managing a

project environment with R Studio Projects (`.Rproj`

). To go further in

depth he also talked about using git (made very easy with its seamless

integration with R Studio) and the use of Docker. To run Docker you

need an Docker “image” and a Docker “container”. An image is a file,

called a `Dockerfile`

, that has the information about and configures the

Operating System for the environment. The container is the the actual

running instance of the “image” that is defined by the Docker file.

Going back to the issue of running a workshop, using Docker allows all

of the participants to run R inside a specific container, an enclosed environment

set up by the instructor, so that all the different dependencies and

version differences won’t prevent you from running the code provided in

the workshop.

Other good introductions to Docker and R:

- Docker, R, And Reproducibility – Colin

Fay - R Docker Tutorial – rOpenSci

Labs - Docker and Packrat – Joel

Nitta

`@niszet`

talked about reproducible reporting with R Markdown. He was

certainly the right person to give a talk on this topic as he is the

author of the mini-books, “Create a Word document with R Markdown” and

“Create a PowerPoint presentation with R Markdown”. To begin, he talked

about cases where one writes a document, report, or any kind of output

and how it might be a good idea to be able to create it again for

“future me” or anybody else that might want to take a look. Normally,

you would run into problems such as “where did my data go?”, “how did I

pre-process my data?”, “” but you can mitigate these problems by using R

Markdown reports. Whether it’s importing your raw data, the

pre-processing/modelling/viz code, to the actual report and

documentation write-up R Markdown renders the document in a completely

clean environment each time so the results should be the same, every

time! As noted in the beginning, you can create many different kinds of

output such as Word document, PowerPoint presentation, html document,

presentation slides, and more – even business cards (with the pagedown

package)! Following an explanation of what you can do with R Markdown,

`@niszet`

talked about **how** exactly one would use it to its best

capabilities. In a live-demo he took us through all the different

components of an R Markdown document:

- YAML header: Where one defines the “how” of the RMD such as the

title, template, output directory, output type, etc. - Code chunk and options: Where all your code (can be languages

besides R) that you want to be run should be located. Chunk options such as

whether to evaluate the code within, toggle showing the code, and

many more are also specified here. - Markdown text: Regular text using markdown. Can also include inline

code using “. - Various buttons/shortcut keys: Such as
`Ctrl + Shift + K`

to Knit!

Some other intros to R Markdown:

- R Markdown: The Definitive Guide – Yihui Xie, JJ Allaire, Garrett

Grolemund - R Markdown Cheat Sheet –

RStudio - R Markdown 入門 – Kazuhiro Maeda (in

Japanese)

It’s only been 3 (three!) days since `@GotaMorishita`

started learning R

yet here he was giving a LT on finding a new place to live in Tokyo

using R! Tired of living with his parents `@GotaMorishita`

decided to

live by himself and started thinking about ways to use R and machine

learning to search for a place with an optimal price and location. After

web-scraping a housing website and pre-processing the data he came

across a problem: if he split the data into a training and test set for

selecting the best predictive model then he would be throwing away a

considerable amount of possible candidates for housing.

If `@GotaMorishita`

took out 90% of the houses/apartments from the

training data and kept those as candidates for the test data, it woudl’ve meant

that the training data will have a markedly different distribution

compared to the test data set and the model created from the training

set wouldn’t be as accurate. This problem, called co-variate shifting, is when the training data and

test data have different distributions but the conditional distribution

of the output values given the input data is unchanged. Using standard

methods of model selection such as cross-validation or AIC in this

situation leads to biasedness. However, this problem can be mitigated by

weighting the loss function by importance (the ratio of training and

test input densities). You can find a more detailed description in the

research papers below. `@GotaMorishita`

used `xgboost`

to implement his

models, one with importance weighting and another without, and used

group-wise cross-validation to tune the hyperparameters. The results are

shown below, comparing the overall test scores for all Tokyo districts

(left) and just the Sangenjaya district (right), the RMSE was smaller

when Importance Weighting was used.

Some more info on co-variate shifting and importance weighting:

- Sugiyama, M., Krauledat, M., & Mueller, K. R. (2007). Covariate

shift adaptation by importance weighted cross validation. Journal of

Machine Learning Research, 8(May),

985-1005 - Sugiyama, M. (2012). Learning under non-stationarity: Covariate

shift adaptation by importance weighting. In Handbook of

Computational Statistics (pp. 927-952). Springer, Berlin,

Heidelberg.

`@sk_bono36`

gave a presentation on using R for marketing purposes with

the `rtweet`

package. In marketing there is a concept called a “persona”

which is a blueprint of what a certain type of person in your target

audience for your product/service is like. A basic persona template

can consist of their name, job title, demographic details, values, and

hobbies. You create these ideal customers through careful market

research involving both social media analytics and

interviewing/surveying the customers themselves. `@sk_bono36`

focused on

creating a `persona`

(with “自転車/Bicycle” as the keyword for this case

study) by using `rtweet`

then running Japanese text analysis with

`RMeCab`

. Visualizing the data with word clouds and network graphs of

bi-grams he was able to find a lot of information on Twitter users who

have “bicycle” on their profile or tweets and extract the common

characteristics of this type of person.

As a result of his analysis `@sk_bono36`

was able to create a persona of

people who like bicycles:

- 20~30 Years Old
- Owns a road bike
- Friendly disposition
- Likes Anime/video games
- Does weight lifting

Some other intros to the `rtweet`

package:

- rtweet workshop – Mike Kearney (author of

rtweet) - 21 Recipes for Mining Twitter Data with rtweet – Bob

Rudis

`@igjit`

, who also presented at

Japan.R back in December on

building an R compiler with

R,

gave a talk about another recent project of his which is a package that

acts as a type checking system for R. A problem that he finds in R is

that errors relating to having the wrong data type in the arguments of R

functions are only found when code is executed, not before. Frustrated

by this `@igjit`

created his own package called

typrr that type checks your code! The

underlying checker that `typrr`

runs is

Flycheck which is a syntax

checking extension for Emacs.

For now, the package only checks for the basic data types found in R,

integer, double, logical, and character and it can only check functions

with one argument only. He rounded off the talk by saying that he

created this package just for fun and he advised all the beginneRs in

the audience that people learn from **doing** rather than just

**reading** so to truly get better it’s best to go out and experiment!

- ao053934144: R in the Healthcare Industry
- hana_orin: Using Hierarchical

Linear Modeling to predict housing prices - AsaKiriSun: Predicting sample

size from Rbinom(n=10, size,

prob)

Following all of the talks, those who were staying for the after-party

were served sushi and drinks! With a loud rendition of “kampai!”

(cheers!) R users from all over Tokyo began to talk about their

successes and struggles with R. A fun tradition at `TokyoR`

is a

**Rock-Paper-Scissors** tournament with the prize being **free** data

science books (I still haven’t won any nor even gotten close to the last rounds)!

The prizes for this month was:

- A couple copies of “Create a Word document with R Markdown”

mini-book by niszet. - 3 copies of the Japanese translation (by Hoxo-m

Inc.) of “Feature Engineering for Machine

Learning” by Alice Zheng and Amanda Casari provided by

uribo.

`TokyoR`

happens almost monthly and it’s a great way to mingle with

Japanese R users as it’s the largest regular meetup here in Japan. Talks

in English are also welcome so if you’re ever in Tokyo come join us!

To **leave a comment** for the author, please follow the link and comment on their blog: ** R by R(yo)**.

R-bloggers.com offers

(This article was first published on ** R – Statistical Odds & Ends**, and kindly contributed to R-bloggers)

The NBA playoffs are in full swing! A total of 16 teams are competing in a playoff-format competition, with the winner of each best-of-7 series moving on to the next round. In each matchup, two teams play 7 basketball games against each other, and the team that wins more games progresses. Of course, we often don’t have to play all 7 games: we can determine the overall winner once either team reaches 4 wins.

During one of the games, a commentator made a remark along the lines of * “you have no idea how hard it is for the better team to lose in a best-of-7 series”*. In this post, I’m going to try and quantify how hard it is to do so! I will do this not via analytical methods, but by Monte Carlo simulation. (You can see the code all at once here.)

Let’s imagine an idealized set-up, where for any given game, team A has probability of beating team B. This probability is assumed to be the same from game to game, and the outcome of each game is assumed to be independent of the others. With these assumptions, the number of games that team A wins has a binomial distribution .

In our simulation, for each , we will generate a large number of random values and determine the proportion of them that are : that would be our estimate for the winning probability of the 7-game series. * How many random values should we draw?* We should draw enough so that we are reasonably confident of the accuracy of the proportion estimate. If we draw samples, then the plug-in estimate of standard error is . In what follows, we will plot estimates with error bars indicating standard error.

First, let’s set up a function that takes in three parameters: the number of simulation runs (`simN`

), the total number of games in a series (`n`

), and the probability that team A wins (`p`

). The function returns the proportion of the `simN`

runs which team A wins.

sim_fn <- function(simN, n, p) { if (n %% 2 == 0) { stop("n should be an odd number") } mean(rbinom(simN, n, p) > n / 2) }

Next, we set up a data frame with each row representing a different win probability for team A. Because of symmetry inherent in the problem, we focus on win probabilities which are .

n <- 7 df <- data.frame(p = seq(0.5, 1, length.out = 26), n = n)

For a start, we set `simN`

to be 100. For each row, we run `simN`

simulation runs to get the win probability estimate and compute the associated standard error estimate. We also compute the upper and lower 1 standard error deviations from the probability estimate, to be used when plotting error bars.

set.seed(1043) simN <- 100 df$win_prob <- apply(df, 1, function(x) sim_fn(simN, x[2], x[1])) # compute std err & p_hat \pm 1 std err df$std_err <- with(df, sqrt(win_prob * (1 - win_prob) / simN)) df$lower <- with(df, win_prob - std_err) df$upper <- with(df, win_prob + std_err)

Here is the code for the plot and the plot itself:

library(ggplot2) ggplot(df, aes(x = p, y = win_prob)) + geom_line() + geom_linerange(aes(ymin = lower, ymax = upper)) + labs(title = paste("Win probability for best of", n, "series"), x = "Win prob for single game", y = "Win prob for series")

We can see that the line is still quite wiggly with large error bars, suggesting that 100 simulation runs is not enough. (Indeed, we expect the graph to be monotonically increasing.) Below, we run 100,000 simulations for each value of `p`

instead (with the same random seed):

We get a much smoother line, and the error bars are so small we can hardly see them. My conclusion here is that while it is harder for the weaker team to win a best-of-7 series that a single game, the odds are not insurmountable. For example, a team that has a 70% chance of winning any one game, which is a huge advantage, still has about a 13% chance of losing a best-of-7 series: not insignificant!

How do these probabilities change if we consider best-of-n series for different values of `n`

? The code below is very similar to that above, except that our data frame contains data for `n`

equal to all the odd numbers from 1 to 15, not just `n = 7`

.

p <- seq(0.5, 1, length.out = 26) n <- seq(1, 15, length.out = 8) df <- expand.grid(p, n) names(df) <- c("p", "n") set.seed(422) simN <- 100000 df$win_prob <- apply(df, 1, function(x) sim_fn(simN, x[2], x[1])) # compute std err & p_hat \pm 1 std err df$std_err <- with(df, sqrt(win_prob * (1 - win_prob) / simN)) df$lower <- with(df, win_prob - std_err) df$upper <- with(df, win_prob + std_err) ggplot(df, aes(x = p, y = win_prob, col = factor(n))) + geom_line() + geom_linerange(aes(ymin = lower, ymax = upper)) + labs(title = paste("Win probability for best of n series"), x = "Win prob for single game", y = "Win prob for series") + theme(legend.position = "bottom")

Here is the plot:

As we expect, the series win probabilities increase as `n`

increases. Also, the `n = 1`

graph corresponds to the line. It looks like the win probabilities don’t change much from best-of-9 to best-of-15.

To **leave a comment** for the author, please follow the link and comment on their blog: ** R – Statistical Odds & Ends**.

R-bloggers.com offers

(This article was first published on ** R – r4stats.com**, and kindly contributed to R-bloggers)

Now that I’ve completed seven detailed reviews of Graphical User Interfaces (GUIs) for R, let’s compare them. It’s easy enough to count their features and plot them, so let’s start there. I’m basing the counts on the number of menu items in each category. That’s not too hard to get, but it’s far from perfect. Some software has fewer menu choices, depending instead on dialog box choices. Studying every menu and dialog box would be too time-consuming, so be aware of this limitation. I’m putting the details of each measure in the appendix** **so you can adjust the figures and create your own graphs. If you decide to make your own graphs, I’d love to hear from you in the comments below.

Figure 1 shows the number of analytic methods each software supports on the x-axis and the number of graphics methods on the y-axis. The analytic methods count combines statistical features, machine learning / artificial intelligence ones (ML/AI), and the ability to create R model objects. The graphics features count totals up the number of bar charts, scatterplots, etc. each package can create.

The ideal place to be in this graph is in the upper right corner. We see that BlueSky and R Commander offer quite a lot of both analytic and graphical features. Rattle stands out as having the second greatest number of graphics features. JASP is the lowest on graphics features and 3rd from the bottom on analytic ones.

Next, let’s swap out the y-axis for general usability features. These consist of a variety of features that make your work easier, including data management capabilities (see appendix for details).

Figure 2 shows that BlueSky and R Commander still in the top two positions overall, but now Deducer has nearly caught up with R Commander on the number of general features. That’s due to its reasonably strong set of data management tools, plus its output is in true word processing tables saving you the trouble of formatting it yourself. Rattle is much lower in this plot since, while its graphics capabilities are strong (at least in relation to ML/AI tasks), it has minimal data management capabilities.

These plots help show us three main overall feature sets, but each package offers things that the others don’t. Let’s look at a brief overview of each. Remember that each of these has a detailed review that follows my standard template. I’ll start with the two that have come out on top, then follow in alphabetical order.

**The R Commander** – This is the oldest GUI, having been around since at least 2005. There are an impressive 41 plug-ins developed for it. It is currently the only R GUI that saves R Markdown files, but it does not create word processing tables by default, as some of the others do. The R code it writes is classic, rarely using the newer tidyverse functions. It works as a partner to R; you install R separately, then use it to install and start R Commander. It makes it easy to blend menu-based analysis with coding. If your goal is to learn to code in classic R, this is an excellent choice.

**BlueSky Statistics** – This software was created by former SPSS employees and it shares many of SPSS’ features. BlueSky is only a few years old, and it converted from commercial to open source just a few months ago. Although BlueSky and R Commander offer many of the same features, they do them in different ways. When using BlueSky, it’s not initially apparent that R is involved at all. Unless you click the “Syntax” button that every dialog box has, you’ll never see the R code or the code editor. Its output is in publication-quality tables which follow the popular style of the American Psychological Association.

**Deducer **– This has a very nice-looking interface, and it’s probably the first to offer true word processing tables by default. Being able to just cut and paste a table into your word processor saves a lot of time and it’s a feature that has been copied by several others. Deducer was released in 2008, and when I first saw it, I thought it would quickly gain developers. It got a few, but development seems to have halted. Deducer’s installation is quite complex, and it depends on the troublesome Java software. It also used JGR, which never became as popular as the similar RStudio. The main developer, Ian Fellows, has moved on to another very interesting GUI project called Vivid.

**jamovi **– The developers who form the core of the jamovi project used to be part of the JASP team. Despite the fact that they started a couple of years later, they’re ahead of JASP in several ways at the moment. Its developers decided that the R code it used should be visible and any R code should be executable, something that differentiated it from JASP. jamovi has an extremely interactive interface that shows you the result of every selection in each dialog box. It also saves the settings in every dialog box, and lets you re-use every step on a new dataset by saving a “template.” That’s extremely useful since GUI users often don’t want to learn R code. jamovi’s biggest weakness its dearth of data management tasks, though there are plans to address that.

**JASP **– The biggest advantage JASP offers is its emphasis on Bayesian analysis. If that’s your preference, this might be the one for you. At the moment JASP is very different from all the other GUIs reviewed here because it won’t show you the R code it’s writing, and you can’t execute your own R code from within it. Plus the software has not been open to outside developers. The development team plans to address those issues, and their deep pockets should give them an edge.

**Rattle **– If your work involves ML/AI (a.k.a. data mining) instead of standard statistical methods, Rattle may be the best GUI for you. It’s focused on ML/AI, and its tabbed-based interface makes quick work of it. However, it’s the weakest of them all when it comes to statistical analysis. It also lacks many standard data management features. The only other GUI that offers many ML/AI features is BlueSky.

**RKWard **– This GUI blends a nice point-and-click interface with an integrated development environment that is the most advanced of all the other GUIs reviewed here. It’s easy to install and start, and it saves all your dialog box settings, allowing you to rerun them. However, that’s done step-by-step, not all at once as jamovi’s templates allow. The code RKWard creates is classic R, with no tidyverse at all.

I hope this brief comparison will help you choose the R GUI that is right for you. Each offers unique features that can make life easier for non-programmers. If one catches your eye, don’t forget to read the full review of it here.

Writing this set of reviews has been a monumental undertaking. It would not have been possible without the assistance of Bruno Boutin, Anil Dabral, Ian Fellows, John Fox, Thomas Friedrichsmeier, Rachel Ladd, Jonathan Love, Ruben Ortiz, Christina Peterson, Josh Price, Eric-Jan Wagenmakers, and Graham Williams.

In figures 1 and 2, *Analytic Features* adds up: statistics, machine learning / artificial intelligence, the ability to create R model objects, and the ability to validate models using techniques such as k-fold cross-validation. The *Graphics Features* is the sum of two rows, the number of graphs the software can create plus one point for small multiples, or facets, if it can do them. *Usability* is everything else, with each row worth 1 point, except where noted.

Feature |
Definition |

Simple installation |
Is it done in one step? |

Simple start-up | Does it start on its own without starting R, loading packages, etc.? |

Import Data Files | How many files types can it import? |

Import Database |
How many databases can it read from? |

Export Data Files | How many file formats can it write to? |

Data Editor | Does it have a data editor? |

Can work on >1 file | Can it work on more than one file at a time? |

Variable View |
Does it show metadata in a variable view, allowing for many fast edits to metadata? |

Data Management |
How many data management tasks can it do? |

Transform Many |
Can it transform many variables at once? |

Types | How many graph types does it have? |

Small Multiples |
Can it show small multiples (facets)? |

Model Objects | Can it create R model objects? |

Statistics | How many statistical methods does it have? |

ML/AI | How many ML / AI methods does it have? |

Model Validation | Does it offer model validation (k-fold, etc.)? |

R Code IDE | Can you edit and execute R code? |

GUI Reuse | Does it let you re-use work without code? |

Code Reuse | Does it let you rerun all using code? |

Package Management | Does it manage packages for you? |

Table of Contents | Does output have a table of contents? |

Re-order | Can you re-order output? |

Publication Quality | Is output in publication quality by default? |

R Markdown | Can it create R Markdown? |

Add comments |
Can you add comments to output? |

Group-by | Does it do group-by repetition of any other task? |

Output as Input |
Does it save equivalent to broom’s tidy, glance, augment? (They earn 1 point for each) |

Developer tools | Does it offer developer tools? |

Feature | BlueSky | Deducer | JASP | jamovi | Rattle | Rcmdr | RKWard |

Simple installation | 1 | 0 | 1 | 1 | 0 | 0 | 1 |

Simple start-up | 1 | 1 | 1 | 1 | 0 | 0 | 1 |

Import Data Files | 7 | 13 | 4 | 5 | 9 | 7 | 5 |

Import Database | 5 | 0 | 0 | 0 | 1 | 0 | 0 |

Export Data Files | 5 | 7 | 1 | 4 | 1 | 3 | 3 |

Data Editor | 1 | 1 | 0 | 1 | 0 | 1 | 1 |

Can work on >1 file | 1 | 1 | 0 | 0 | 0 | 0 | 0 |

Variable View | 1 | 1 | 0 | 0 | 0 | 0 | 0 |

Data Management | 30 | 9 | 2 | 3 | 9 | 25 | 4 |

Transform Many | 1 | 1 | 0 | 1 | 1 | 1 | 0 |

Types | 25 | 16 | 9 | 12 | 24 | 21 | 14 |

Small Multiples | 1 | 1 | 0 | 0 | 0 | 1 | 0 |

Model Objects | 1 | 1 | 0 | 0 | 0 | 1 | 1 |

Statistics | 96 | 37 | 26 | 44 | 8 | 95 | 22 |

ML/AI | 9 | 0 | 0 | 0 | 12 | 0 | 0 |

Model Validation | 1 | 0 | 0 | 0 | 1 | 0 | 0 |

R Code IDE | 1 | 1 | 0 | 1 | 0 | 0 | 1 |

GUI Reuse | 0 | 0 | 1 | 0 | 0 | 0 | 1 |

Code Reuse | 1 | 1 | 0 | 1 | 1 | 1 | 1 |

Package Management | 1 | 0 | 1 | 1 | 0 | 0 | 0 |

Output: Table of Contents | 1 | 0 | 0 | 0 | 0 | 0 | 0 |

Output: Re-order | 0 | 0 | 0 | 0 | 0 | 0 | 0 |

Output: Publication Quality | 1 | 1 | 1 | 1 | 0 | 0 | 0 |

Output: R Markdown | 0 | 0 | 0 | 0 | 0 | 1 | 0 |

Output: Add comments | 0 | 0 | 1 | 0 | 0 | 1 | 0 |

Group-by / Split File | 1 | 0 | 0 | 0 | 0 | 0 | 0 |

Output as Input | 3 | 1 | 0 | 0 | 0 | 1 | 0 |

Developer tools | 1 | 1 | 0 | 1 | 0 | 1 | 1 |

Total |
196 |
94 |
48 |
77 |
67 |
160 |
56 |

To **leave a comment** for the author, please follow the link and comment on their blog: ** R – r4stats.com**.

R-bloggers.com offers

(This article was first published on ** R – Xi'an's Og**, and kindly contributed to R-bloggers)

**A** rather blah number Le Monde mathematical puzzle:

Find all integer multiples of 11111 with exactly one occurrence of each decimal digit..

Which I solved by brute force, by looking at the possible range of multiples (and borrowing stringr:str_count from Robin!)

> combien=0 > for (i in 90001:900008){ j=i*11111 combien=combien+(min(stringr::str_count(j,paste(0:9)))==1)} > combien [1] 3456

And a bonus one:

Find all integers y that can write both as x³ and (10z)³+a with 1≤a≤999.

which does not offer much in terms of solutions since x³-v³=(x-v)(x²+xv+v²)=a shows that x² is less than 2a/3, meaning x is at most 25. Among such numbers only x=11,12 lead to a solution as x³=1331,1728.

To **leave a comment** for the author, please follow the link and comment on their blog: ** R – Xi'an's Og**.

R-bloggers.com offers

(This article was first published on ** R/exams**, and kindly contributed to R-bloggers)

Experiences with using R/exams for written exams in finance classes with a moderate number of students at Texas A&M International University (TAMIU).

Guest post by Nathaniel P. Graham (Texas A&M International University, Division of International Banking and Finance Studies).

While R/exams was originally written for large statistics classes, there is nothing specific to either large courses or statistics. In this article, I will describe how I use R/exams for my “Introduction to Finance (FIN 3310)” course. While occasionally I might have a class of 60 students, I generally have smaller sections of 20–25. These courses are taught and exams are given in-person (some material and homework assignments are delivered via the online learning management system, but I do not currently use R/exams to generate that).

My written exams are generated by `exams2nops()`

and have two components: a multiple-choice portion and a short-answer portion. The former are generated using R/exams’s dynamic exercise format while the latter are included from static PDF documents. Example pages from a typical exam are linked below:

Because I have a moderate number of students, I grade all my exams manually. This blog post outlines how the exercises are set up and which R scripts I use to automate the generation of the PDF exams.

The actual questions in my exams are not different from many other R/exams examples, except that they are about finance instead of statistics. An example is included below with LaTeX markup and both the R/LaTeX (.Rnw) and R/Markdown (.Rmd) versions can be downloaded: goalfinance.Rnw, goalfinance.Rmd. I have 7 possible answers, but only 5 will be randomly chosen for a given exam (always including the 1 correct answer). To do so, many of my non-numerical questions take advantage of the `exshuffle`

option here. (If you are one of my students, I *told* you this would be on the exam!)

Periodic updates to individual questions (stored as separate files) keep the exams fresh.

```
\begin{question}
The primary goal of financial management is to maximize the \dots
\begin{answerlist}
\item Present value of future operating cash flows
\item Net income, according to GAAP/IFRS
\item Market share
\item Value of the firm
\item Number of shares outstanding
\item Book value of shareholder equity
\item Firm revenue
\end{answerlist}
\end{question}
\exname{Goal of Financial Management}
\extype{schoice}
\exsolution{0001000}
\exshuffle{5}
```

Since this is for a finance class, there are numerical exercises as well. Every semester, I promise my students that this problem *will* be on the second midterm. The problem is simple: given some cash flows and a discount rate, calculate the net present value (NPV), see npvcalc.Rnw, npvcalc.Rmd. Since the cash flows, the discount rate, and the answers are generated from random numbers, I would not gain much from defining more than five possible, the way I did in the qualitative question above. As you might expect, some of the incorrect answers are common mistakes students make when approaching NPV, so I have not lost anything relative to the carefully crafted questions and answers I used in the past to find out who really learned the material.

`<`>=
discountrate <- round(runif(1, min = 6.0, max = 15.0), 2)
r <- discountrate / 100.0
cf0 <- sample(10:20, 1) * -100
ocf <- sample(seq(200, 500, 25), 5)
discounts <- sapply(1:5, function(i) (1 + r) ** i)
npv <- round(sum(ocf / discounts) + cf0, 2)
notvm <- round(sum(ocf) + cf0, 2)
wrongtvm <- round(sum(ocf / (1.0 + r)) + cf0, 2)
revtvm <- round(sum(ocf * (1.0 + r)) + cf0, 2)
offnpv <- round(npv + sample(c(-200.0, 200.0), 1), 2)
@
\begin{question}
Assuming the discount rate is \Sexpr{discountrate}\%, find the
net present value of a project with the following cash flows, starting
at time 0: \$\Sexpr{cf0}, \Sexpr{ocf[1]}, \Sexpr{ocf[2]}, \Sexpr{ocf[3]},
\Sexpr{ocf[4]}, \Sexpr{ocf[5]}.
\begin{answerlist}
\item \$\Sexpr{wrongtvm}
\item \$\Sexpr{notvm}
\item \$\Sexpr{npv}
\item \$\Sexpr{revtvm}
\item \$\Sexpr{offnpv}
\end{answerlist}
\end{question}
\exname{Calculating NPV}
\extype{schoice}
\exsolution{00100}
\exshuffle{5}

While there are other methods of generating randomized exams out there, few are as flexible as R/exams, in large part because we have full access to R. The next example also appears on my second midterm; it asks students to compute the internal rate of return for a set of cash flows, see irrcalc.Rnw, irrcalc.Rmd. Numerically, this is a simple root-finding problem, but systems that do not support more advanced mathematical operations (such as Blackboard’s “Calculated Formula” questions) can make this difficult or impossible to implement directly.

`<`>=
cf0 <- sample(10:16, 1) * -100
ocf <- sample(seq(225, 550, 25), 5)
npvfunc <- function(r) {
discounts <- sapply(1:5, function(i) (1 + r) ** i)
npv <- (sum(ocf / discounts) + cf0) ** 2
return(npv)
}
res <- optimize(npvfunc, interval = c(-1,1))
irr <- round(res$minimum, 4) * 100.0
wrong1 <- irr + sample(c(1.0, -1.0), 1)
wrong2 <- irr + sample(c(0.25, -0.25), 1)
wrong3 <- irr + sample(c(0.5, -0.5), 1)
wrong4 <- irr + sample(c(0.75, -0.75), 1)
@
\begin{question}
Find the internal rate of return of a project with the following cash flows,
starting at time 0: \$\Sexpr{cf0}, \Sexpr{ocf[1]}, \Sexpr{ocf[2]},
\Sexpr{ocf[3]}, \Sexpr{ocf[4]}, \Sexpr{ocf[5]}.
\begin{answerlist}
\item \Sexpr{wrong1}\%
\item \Sexpr{wrong2}\%
\item \Sexpr{irr}\%
\item \Sexpr{wrong3}\%
\item \Sexpr{wrong4}\%
\end{answerlist}
\end{question}
\exname{Calculating IRR}
\extype{schoice}
\exsolution{00100}
\exshuffle{5}

While `exams2nops()`

has some support for open-ended questions that are scored manually, it is very limited. I create and format those questions in the traditional manner (Word or LaTeX) and save the resulting file as a PDF. Alternatively, a custom `exams2pdf()`

could be used for this. Finally, I add this PDF file on to the end of the exam using the `pages`

option of `exams2nops()`

(as illustrated in more detail below). As an example, the short-answer questions from the final exam above are given in the following PDF file:

To facilitate automatic inclusion of these short-answer questions, a naming convention is adopted in the scripts below. The script looks for a file named `examNsa.pdf`

, where `N`

is the exam number (1 for the first midterm exam, 2 for the second, and 3 for the final exam) and sets `pages`

to it. So for the final exam, it looks for the file `exam3sa.pdf`

. If I want to update my short-answer questions, I just replace the old PDF with a new one.

I keep the questions for each exam in their own directory, which makes the script below that I use to generate exams 1 and 2 straightforward. For the moment, the script uses all the questions in an exam’s directory, but I if want to guarantee that I have exactly (for example) 20 multiple choice questions, I could set `nsamp = 20`

instead of however many Rnw (or Rmd) files it finds.

Note that `exlist`

is a list with one element (a vector of file names), so that R/exams will randomize the order of the questions as well. If I wanted to specify the order, I could make `exlist`

a list of N elements, where each element was exactly one file/question. The `nsamp`

option could then be left unset (the default is `NULL`

).

When I generate a new set of exams, I only need to update the first four lines, and the script does the rest. Note that I use a modified version of the English `en.dcf`

language configuration file in order to adapt some terms to TAMIU’s terminology, e.g., “ID Number” instead of the `exams2nops()`

default “Registration Number”. See en_tamiu.dcf for the details. Since the TAMIU student ID numbers have 8 digits, I use the `reglength = 8`

argument, which sets the number of digits in the ID Number to 8.

```
library("exams")
## configuration: change these to make a new batch of exams
exid <- 2
exnum <- 19
exdate <- "2019-04-04"
exsemester <- "SP19"
excourse <- "FIN 3310"
## options derived from configuration above
exnam <- paste0("fin3310exam", exid)
extitle <- paste(excourse, exsemester, "Exam", exid, sep = " ")
saquestion <- paste0("SA_questions/exam", exid, "sa.pdf")
## exercise directory (edir) and output directory (odir)
exedir <- paste0("fin3310_exam", exid, "_exercises")
exodir <- "nops"
if(!dir.exists(exodir)) dir.create(exodir) ## in case it was previously deleted
exodir <- paste0(exodir, "/exam", exid, "_", format(Sys.Date(), "%Y-%m-%d"))
## exercise list: one element with a vector of all available file names
exlist <- list(dir(path = exedir, pattern = "\.Rnw$"))
## generate exam
set.seed(54321) # not the actual random seed
exams2nops(file = exlist,
n = exnum, nsamp = length(exlist[[1]]), dir = exodir, edir = exedir,
name = exnam, date = exdate, course = excourse, title = extitle,
institution = "Texas A\\&M International University",
language = "en_tamiu.dcf", encoding = "UTF-8",
pages = saquestion, blank = 1, logo = NULL, reglength = 8,
samepage = TRUE)
```

My final exam is comprehensive, so I would like to include questions from the previous exams. I do not want to keep separate copies of those questions just for the final, in case I update one version and forget to update the other, so I need a script that gathers up questions from the first two midterms and adds in some questions specific to the final.

The script below uses a feature that has long been available in R/exams but was undocumented up to version 2.3-2: You can set `edir`

to a directory and all its subdirectories will be included in the search path. I have specified some required questions that I want to appear on every student’s exam; each student will also get a random draw of other questions from the first two midterms in addition to some questions that only appear on the final. How many of each is controlled by `exnsamp`

, which is passed to the `nsamp`

argument. They add up to 40 currently, so `exams2nops()`

’s 45 question limit does not affect me.

```
library("exams")
## configuration: change these to make a new batch of exams
exnum <- 19
exdate <- "2019-04-30"
exsemester <- "SP19"
excourse <- "FIN 3310"
## options derived from configuration above
extitle <- paste(excourse, exsemester, "Exam", exid, sep = " ")
## exercise directory (edir) and output directory (odir)
exedir <- getwd()
exodir <- "nops"
if(!dir.exists(exodir)) dir.create(exodir)
exodir <- paste0(exodir, "/exam3", "_", format(Sys.Date(), "%Y-%m-%d"))
## exercises: required and from previous midterms
exrequired <- c("goalfinance.Rnw", "pvcalc.Rnw", "cashcoverageortie.Rnw",
"ocfcalc.Rnw", "discountratecalc.Rnw", "npvcalc.Rnw", "npvcalc2.Rnw",
"stockpriceisabouttopaycalc.Rnw", "annuitycalc.Rnw", "irrcalc.Rnw")
exlist1 <- dir(path = "fin3310_exam1_exercises", pattern = "\.Rnw$")
exlist2 <- dir(path = "fin3310_exam2_exercises", pattern = "\.Rnw$")
exlist3 <- dir(path = "fin3310_exam3_exercises", pattern = "\.Rnw$")
## final list and corresponding number to be sampled
exlist <- list(
exrequired,
setdiff(exlist1, exrequired),
setdiff(exlist2, exrequired),
exlist3
)
exnsamp <- c(10, 5, 10, 15)
## generate exam
set.seed(54321) # not the actual random seed
exams2nops(file = exlist,
n = exnum, nsamp = exnsamp, dir = exodir, edir = exedir,
name = "fin3310exam3", date = exdate, course = excourse, title = extitle,
institution = "Texas A\\&M International University",
language = "en_tamiu.dcf", encoding = "UTF-8",
pages = "SA_questions/exam3sa.pdf", blank = 1, logo = NULL,
reglength = 8, samepage = TRUE)
```

R/exams can read scans of the answer sheet, automating grading for the multiple-choice questions. For a variety of reasons, I do not use any of those features. If I had to teach larger classes I would doubtless find a way to make it convenient, but for the foreseeable future I will continue to use the function below to produce the answer key for each exam, which I grade by hand. This is not especially onorous, since I have to grade the short-answer questions by hand anyway.

Given the serialized R data file (.rds) produced by `exam2nops()`

the corresponding `exams_metainfo()`

can be extracted. This comes with a `print()`

method that displays all correct answers for a given exam. Starting from R/exams version 2.3-3 (current R-Forge devel version) it is also possible to split the output into blocks of five for easy reading (matching the blocks on the answer sheets). As an example the first few correct answers of the third exam are extracted:

```
fin3310exam3 <- readRDS("fin3310exam3.rds")
print(exams_metainfo(fin3310exam1), which = 3, block = 5)
```

```
## 19043000003
## 1. Discount Rate: 1
## 2. Calculating IRR: 4
## 3. Present Value: 3
## 4. Calculating stock prices: 5
## 5. Calculating NPV: 3
##
## 6. Annuity PV: 3
## 7. Goal of Financial Management: 1
## 8. Finding OCF: 5
## ...
```

It can be convenient to display the correct answers in a customized format, e.g., with letters instead of numbers and omitting the exercise title text. To do so, the code below sets up a function `exam_solutions()`

, applying it to the same exam as above.

```
exam_solutions <- function(examdata, examnum) {
solutions <- LETTERS[sapply(examdata[[examnum]],
function(x) which(x$metainfo$solution))]
data.frame(Solution = solutions)
}
split(exam_solutions(fin3310exam3, examnum = 3), rep(1:8, each = 5))
```

```
## $`1`
## Solution
## 1 A
## 2 D
## 3 C
## 4 E
## 5 C
##
## $`2`
## Solution
## 6 C
## 7 A
## 8 E
## ...
```

The only downside of the manual grading approach is that I do not have students’ responses in an electronic format for easy statistical analysis, but otherwise grading is very fast.

Using the system above, R/exams works well even for small courses with in-person, paper exams. I do not need to worry about copies of my exams getting out and students memorizing it, or any of the many ways students have found to cheat over the years. By doing all the formatting work for me, R/exams helps me avoid a lot of the finicky aspects of adding, adjusting, or removing questions from an existing exam, and generally keeps exam construction focused on the important part: writing questions that gauge students’ progress.

To **leave a comment** for the author, please follow the link and comment on their blog: ** R/exams**.

R-bloggers.com offers

(This article was first published on ** R – Win-Vector Blog**, and kindly contributed to R-bloggers)

I thought I would give a personal update on our book: *Practical Data Science with R* 2nd edition; Zumel, Mount; Manning 2019.

The second edition should be fully available this fall! Nina and I have finished up through chapter 10 (of 12), and Manning has released previews of up through chapter 7 (with more to come!).

At this point the second edition is very real. So it makes me feel a bit conflicted about the 1st edition. If you need a good guide to data science in R *in paper form right now* please consider buying the 1st edition. If you are happy with an e-book: please consider buying the 2nd edition MEAP (Manning Early Access Program) now, it gets you previews of the new book, a complete e-copy of the 1st edition, and a complete e-copy of the 2nd edition when that is done! And we expect the fully printed version of the 2nd edition to be available this fall.

The 1st edition remains an excellent book that we are very proud of. I have 12 more copies of it (that I paid for) in my closet for gifting. Just four days ago I achieved a personal 1st edition goal: handing a print copy of the 1st edition to Robert Gentleman in person!

But, the 1st edition’s time is coming to an end.

The 1st edition has been with Nina and me since we started work on it in November of 2012 (getting the 1st edition out in April 2014). Earlier drafts were titled “Successful Data Science in R”, and issues around book title made me so nervous the entire book production directory was named “DataScienceBook” until last year about 4 years after the book publication!

Well, now for me *Practical Data Science with R*, 2nd edition is “the book.” Nina and I have put a lot of time into revising and improving the book for the new edition. In fact we started the 2nd edition March of 2018, so we have been working on this new edition for quite some time. If you wonder why you have not seen Nina for the last year, this has been the major reason (she is the leader on both editions of the book).

*Practical Data Science with R* 2nd edition; Zumel, Mount; Manning 2019 is intended as an improved revision of the 1st edition. We fixed things, streamlined things, incorporated new packages, and updated the content to catch up with over 5 years of evolution of the R ecosystem. We also had the chance to add some new content on boosted trees, model explanation, and data wrangling (including data.table!).

On a technical side we have a free support site for the book: https://github.com/WinVector/PDSwR2. As with the first edition: we share all code examples, so you don’t have to try and copy and paste from the e-copy. This is fairly substantial, as *Practical Data Science with R* 2nd edition; Zumel, Mount; Manning 2019 is an example driven book. We also have just released R-markdown re-renderings of all of the examples from the book. These confirm the examples run correctly and are where you can look if you are hung up on trying to run an example (which will almost always be an issue of adding the path from where you have the code to the data, or a small matter of installing packages).

If you have and liked the 1st edition, please consider buying the 2nd edition: it is a substantial improvement. It is the same book, but a better version of it. If you want a book that teaches you how to work as data scientist using R as the example: please buy *Practical Data Science with R* 2nd edition; Zumel, Mount; Manning 2019.

To **leave a comment** for the author, please follow the link and comment on their blog: ** R – Win-Vector Blog**.

R-bloggers.com offers

(This article was first published on ** R – AriLamstein.com**, and kindly contributed to R-bloggers)

Today I am happy to announce a new free course: *Help Your Team Learn R*!

Over the last few years I’ve helped a number of data teams train their analysts to use R. At each company there was a skilled R user who was leading the team’s effort to adopt R.

Each of these internal R advocates, however, had a similar problem: they did not have a background in training. So while they knew that R could help their company, and they were R experts themselves, they struggled to get their team to actually learn R.

*Help Your Team Learn R *is designed to help people in this situation by providing an introduction to some of the key concepts in the field of technical training.

In terms of Pedagogy, the course focuses on *Criterion Referenced Instruction*, which is the branch of training that I have found most useful when teaching R.

The course contains seven lessons that are sent via email over the course of six days. Each email contains:

- One key concept from the field of training
- An example of how I have applied that concept when I teach R
- An exercise to help you apply that concept to your own teaching

If you are looking for help in teaching R to your team, then I encourage you to sign up for *Help Your Team Learn R* using the form below!

The post Free Course: Help Your Team Learn R! appeared first on AriLamstein.com.

To **leave a comment** for the author, please follow the link and comment on their blog: ** R – AriLamstein.com**.

R-bloggers.com offers

(This article was first published on ** Vijay Barve**, and kindly contributed to R-bloggers)

Biodiversity citizen scientists use iNaturalist to post their observations with photographs. The observations are then curated there by crowd-sourcing the identifications and other trait related aspects too. The data once converted to “research grade” is passed on to GBIF as occurrence records.

Exciting news from India in 3rd week of April 2019 is:

Another important milestone in #Biodiversity Citizen Science in #India. This week we crossed 100K verifiable records on @inaturalist this data is about ~10K species by 4500+ observers #CitSci pic.twitter.com/DCF3QxQl1i

— Vijay Barve (@vijaybarve) April 21, 2019

Being interested in biodiversity data visualizations and completeness, I was waiting for 100k records to explore the data. Here is what I did and found out.

**Step 1:** Download the data from iNaturalist website. Which can be done very easily by visiting the website and choosing the right options.

https://www.inaturalist.org/observations?place_id=6681

I downloaded the file as .zip and extracted the observations-xxxx.csv. [In my case it was observations-51008.csv].

**Step 2:** Read the data file in R

library(readr) observations_51008 <- read_csv("input/observations-51008.csv")

**Step 3:** Clean up the data and set it up to be used in package bdvis.

library(bdvis) inatc <- list( Latitude="latitude", Longitude="longitude", Date_collected="observed_on", Scientific_name="scientific_name" ) inat <- format_bdvis(observations_51008,config = inatc)

**Step 4:** We still need to rename some more columns for ease in visualizations like rather than ‘taxon_family_name’ it will be easy to have field called ‘Family’

rename_column <- function(dat,old,new){ if(old %in% colnames(dat)){ colnames(dat)[which(names(dat) == old)] <- new } else { print(paste("Error: Fieldname not found...",old)) } return(dat) } inat <- rename_column(inat,'taxon_kingdom_name','Kingdom') inat <- rename_column(inat,'taxon_phylum_name','Phylum') inat <- rename_column(inat,'taxon_class_name','Class') inat <- rename_column(inat,'taxon_order_name','Order_') inat <- rename_column(inat,'taxon_family_name','Family') inat <- rename_column(inat,'taxon_genus_name','Genus') # Remove records excess of 100k inat <- inat[1:100000,]

**Step 5:** Make sure the data is loaded properly

bdsummary(inat)

will produce some like this:

Total no of records = 100000 Temporal coverage... Date range of the records from 1898-01-01 to 2019-04-19 Taxonomic coverage... No of Families : 1345 No of Genus : 5638 No of Species : 13377 Spatial coverage ... Bounding box of records 6.806092 , 68.532 - 35.0614769085 , 97.050133 Degree celles covered : 336 % degree cells covered : 39.9524375743163

The data looks good. But we have a small issue, we have some records from year 1898, which might cause trouble with some of our visualizations. So let us drop records before year 2000 for the time being.

inat = inat[which(inat$Date_collected > "2000-01-01"),]

Now we are ready to explore the data. First one I always like to see is geographical coverage of the data. First let us try it at 1 degree (~100km) grid cells. Note here I have Admin2.shp file with India states map.

mapgrid(inat,ptype="records",bbox=c(60,100,5,40), shp = "Admin2.shp")

This shows a fairly good geographical coverage of the data at this scale. We have very few degree cells with no data. How about fines scale though? Say at 0.1 degree (~10km) grid. Let us generate that.

mapgrid(inat,ptype="records",bbox=c(60,100,5,40), shp = "Admin2.shp", gridscale=0.1)

Now the pattern is clear, where the data is coming from.

To be continued…

**References**

- Barve, Vijay, and Javier Otegui. 2016. “Bdvis: Visualizing Biodiversity Data in R.” Bioinformatics. doi:http://dx.doi.org/10.1093/bioinformatics/btw333.

To **leave a comment** for the author, please follow the link and comment on their blog: ** Vijay Barve**.

R-bloggers.com offers

(This article was first published on ** R Views**, and kindly contributed to R-bloggers)

Great data science work should be reproducible. The ability to repeat

experiments is part of the foundation for all science, and reproducible work is

also critical for business applications. Team collaboration, project validation,

and sustainable products presuppose the ability to reproduce work over time.

In my opinion, mastering just a handful of important tools will make

reproducible work in R much easier for data scientists. R users should be

familiar with version control, RStudio projects, and literate programming

through R Markdown. Once these tools are mastered, the major remaining challenge

is creating a reproducible environment.

An environment consists of all the dependencies required to enable your code to

run correctly. This includes R itself, R packages, and system dependencies. As

with many programming languages, it can be challenging to manage reproducible R

environments. Common issues include:

- Code that used to run no longer runs, even though the code has not changed.
- Being afraid to upgrade or install a new package, because it might break your code or someone else’s.
- Typing
`install.packages`

in your environment doesn’t do anything, or doesn’t do the*right*thing.

These challenges can be addressed through a careful combination of tools and

strategies. This post describes two use cases for reproducible environments:

- Safely upgrading packages
- Collaborating on a team

The sections below each cover a strategy to address the use case, and the necessary

tools to implement each strategy. Additional use cases, strategies, and tools are

presented at https://environments.rstudio.com. This website is a work in

progress, but we look forward to your feedback.

Upgrading packages can be a risky affair. It is not difficult to find serious R

users who have been in a situation where upgrading a package had unintended

consequences. For example, the upgrade may have broken parts of their current code, or upgrading a

package for one project accidentally broke the code in another project. A

strategy for safely upgrading packages consists of three steps:

- Isolate a project
- Record the current dependencies
- Upgrade packages

The first step in this strategy ensures one project’s packages and upgrades

won’t interfere with any other projects. Isolating projects is accomplished by

creating per-project libraries. A tool that makes this easy is the new `renv`

package. Inside of your R project, simply use:

```
# inside the project directory
renv::init()
```

The second step is to record the current dependencies. This step is critical

because it creates a safety net. If the package upgrade goes poorly, you’ll be

able to revert the changes and return to the record of the working state. Again,

the `renv`

package makes this process easy.

```
# record the current dependencies in a file called renv.lock
renv::snapshot()
# commit the lockfile alongside your code in version control
# and use this function to view the history of your lockfile
renv::history()
# if an upgrade goes astray, revert the lockfile
renv::revert(commit = "abc123")
# and restore the previous environment
renv::restore()
```

With an isolated project and a safety net in place, you can now proceed to

upgrade or add new packages, while remaining certain the current functional

environment is still reproducible. The `pak`

package can be used to install and upgrade

packages in an interactive environment:

```
# upgrade packages quickly and safely
pak::pkg_install("ggplot2")
```

The safety net provided by the `renv`

package relies on access to older versions

of R packages. For public packages, CRAN provides these older versions in the

CRAN archive. Organizations can

use tools like RStudio Package

Manager to make multiple versions

of private packages available. The “snapshot and

restore” approach can also be used

to promote content to production. In

fact, this approach is exactly how RStudio

Connect and

shinyapps.io deploy thousands of R applications to

production each day!

A common challenge on teams is sharing and running code. One strategy that

administrators and R users can adopt to facilitate collaboration is

shared baselines. The basics of the strategy are simple:

- Administrators setup a common environment for R users by installing RStudio Server.
- On the server, administrators install multiple versions of R.
- Each version of R is tied to a frozen repository using a Rprofile.site file.

By using a frozen repository, either administrators or users can install

packages while still being sure that everyone will get the same set of packages.

A frozen repository also ensures that adding new packages won’t upgrade other

shared packages as a side-effect. New packages and upgrades are offered to users

over time through the addition of new versions of R.

Frozen repositories can be created by manually cloning CRAN, accessing a service

like MRAN, or utilizing a supported product like RStudio Package

Manager.

The prior sections presented specific strategies for creating reproducible

environments in two common cases. The same strategy may not be appropriate for

every organization, R user, or situation. If you’re a student reporting an

error to your professor, capturing your `sessionInfo()`

may be all you need. In

contrast, a statistician working on a clinical trial will need a robust

framework for recreating their environment. **Reproducibility is not binary!**

To help pick between strategies, we’ve developed a strategy

map. By answering two questions,

you can quickly identify where your team falls on this map and identify the

nearest successful strategy. The two questions are represented on the x and

y-axis of the map:

- Do I have any restrictions on what packages can be used?
- Who is responsible for managing installed packages?

For more information on picking and using these strategies, please visit

https://environments.rstudio.com. By adopting a strategy for reproducible

environments, R users, administrators, and teams can solve a number of important

challenges. Ultimately, reproducible work adds credibility, creating a solid

foundation for research, business applications, and production systems. We are

excited to be working on tools to make reproducible work in R easy and fun. We

look forward to your feedback, community discussions, and future posts.

To **leave a comment** for the author, please follow the link and comment on their blog: ** R Views**.

R-bloggers.com offers

(This article was first published on ** R – Xi'an's Og**, and kindly contributed to R-bloggers)

**A** neat question from The Riddler on a multi-probability survival rate:

Nine processes are running in a loop with fixed survivals rates .99,….,.91. What is the probability that the first process is the last one to die? Same question with probabilities .91,…,.99 and the probability that the last process is the last one to die.

The first question means that the realisation of a Geometric G(.99) has to be strictly larger than the largest of eight Geometric G(.98),…,G(.91). Given that the cdf of a Geometric G(a) is [when counting the number of attempts till failure, included, i.e. the Geometric with support the positive integers]

the probability that this happens has the nice (?!) representation

which leads to an easy resolution by recursion since

and

and a value of 0.5207 returned by R (Monte Carlo evaluation of 0.5207 based on 10⁷ replications). The second question is quite similar, with solution

and value 0.52596 (Monte Carlo evaluation of 0.52581 based on 10⁷ replications).

To **leave a comment** for the author, please follow the link and comment on their blog: ** R – Xi'an's Og**.

R-bloggers.com offers

(This article was first published on ** S+/R – Yet Another Blog in Statistical Computing**, and kindly contributed to R-bloggers)

After working on the MOB package, I received requests from multiple users if I can write a binning function that takes the weighting scheme into consideration. It is a legitimate request from the practical standpoint. For instance, in the development of fraud detection models, we often would sample down non-fraud cases given an extremely low frequency of fraud instances. After the sample down, a weight value > 1 should be assigned to all non-fraud cases to reflect the fraud rate in the pre-sample data.

While accommodating the request for weighting cases is trivial, I’d like to do a simple experitment showing what the impact might be with the consideration of weighting.

– First of all, let’s apply the monotonic binning to a variable named “tot_derog”. In this unweighted binning output, KS = 18.94, IV = 0.21, and WoE values range from -0.38 to 0.64.

– In the first trial, a weight value = 5 is assigned to cases with Y = 0 and a weight value = 1 assigned to cases with Y = 1. As expected, frequency, distribution, bad_frequency, and bad_rate changed. However, KS, IV, and WoE remain identical.

– In the second trial, a weight value = 1 is assigned to cases with Y = 0 and a weight value = 5 assigned to cases with Y = 1. Once again, KS, IV, and WoE are still the same as the unweighted output.

The conclusion from this demonstrate is very clear. In cases of two-value weights assigned to the binary Y, the variable importance reflected by IV / KS and WoE values should remain identical with or without weights. However, if you are concerned about the binning distribution and the bad rate in each bin, the function wts_bin() should do the correction and is available in the project repository (https://github.com/statcompute/MonotonicBinning).

To **leave a comment** for the author, please follow the link and comment on their blog: ** S+/R – Yet Another Blog in Statistical Computing**.

R-bloggers.com offers