(This article was first published on ** R – insightR**, and kindly contributed to R-bloggers)

In a late post I talked about inference after model selection showing that a simple double selection procedure is enough to solve the problem. In this post I’m going to talk about a generalization of the double selection for any Machine Learning (ML) method described by Chernozhukov et al. (2016).

Suppose you are in the following framework:

where is the parameter of interest, is a set of control variables and and are error terms. The functions and are very generic and possibly non-linear.

How can we estimate ? The most naive way (and obviously wrong) is to simple estimate a regression using only to explain . Another way is to try something similar to the double selection, which is referred by Chernozhukov et al. (2016) as Double Machine Learning (DML):

- (1): Estimate ,
- (2): Estimate without using ,
- (3): Estimate .

However, in this case the procedure above will leave a term on the asymptotic distribution of that will cause bias. This problem may be solved with sample splitting. Suppose we randomly split our observations in two samples of size . The fist sample will be indexed by and the auxiliary second sample will be indexed by . We are going to estimate the first two steps above in the auxiliary sample and the third step will be done into sample . Therefore:

The estimator above is unbiased. However, you are probably wondering about efficiency loss because was estimated using only half of the sample. To solve this problem we must now do the opposite: first we estimate steps 1 and 2 in the sample and then we estimate in the sample . As a result, we will have and . The cross-fitting estimator will be:

which is a simple average between the two s.

The best way to illustrate this topic is using simulation. I am going to generate data from the following model:

- The number of observations and the number of simulations will be 500,
- The number of variables in will be , generated from a multivariate normal distribution,
- ,
- ,
- and are normal with mean 0 and variance 1.

library(clusterGeneration) library(mvtnorm) library(randomForest) set.seed(123) # = Seed for Replication = # N=500 # = Number of observations = # k=10 # = Number of variables in z = # theta=0.5 b=1/(1:k) # = Generate covariance matrix of z = # sigma=genPositiveDefMat(k,"unifcorrmat")$Sigma sigma=cov2cor(sigma)

The ML model we are going to use to estimate steps 1 and 2 is the Random Forest. The simulation will estimate the simple OLS using only to explain , the naive DML without sample splitting and the Cross-fitting DML. The 500 simulations may take a few minutes.

set.seed(123) M=500 # = Number of Simumations = # # = Matrix to store results = # thetahat=matrix(NA,M,3) colnames(thetahat)=c("OLS","Naive DML","Cross-fiting DML") for(i in 1:M){ z=rmvnorm(N,sigma=sigma) # = Generate z = # g=as.vector(cos(z%*%b)^2) # = Generate the function g = # m=as.vector(sin(z%*%b)+cos(z%*%b)) # = Generate the function m = # d=m+rnorm(N) # = Generate d = # y=theta*d+g+rnorm(N) # = Generate y = # # = OLS estimate = # OLS=coef(lm(y~d))[2] thetahat[i,1]=OLS # = Naive DML = # # = Compute ghat = # model=randomForest(z,y,maxnodes = 20) G=predict(model,z) # = Compute mhat = # modeld=randomForest(z,d,maxnodes = 20) M=predict(modeld,z) # = compute vhat as the residuals of the second model = # V=d-M # = Compute DML theta = # theta_nv=mean(V*(y-G))/mean(V*d) thetahat[i,2]=theta_nv # = Cross-fitting DML = # # = Split sample = # I=sort(sample(1:N,N/2)) IC=setdiff(1:N,I) # = compute ghat on both sample = # model1=randomForest(z[IC,],y[IC],maxnodes = 10) model2=randomForest(z[I,],y[I], maxnodes = 10) G1=predict(model1,z[I,]) G2=predict(model2,z[IC,]) # = Compute mhat and vhat on both samples = # modeld1=randomForest(z[IC,],d[IC],maxnodes = 10) modeld2=randomForest(z[I,],d[I],maxnodes = 10) M1=predict(modeld1,z[I,]) M2=predict(modeld2,z[IC,]) V1=d[I]-M1 V2=d[IC]-M2 # = Compute Cross-Fitting DML theta theta1=mean(V1*(y[I]-G1))/mean(V1*d[I]) theta2=mean(V2*(y[IC]-G2))/mean(V2*d[IC]) theta_cf=mean(c(theta1,theta2)) thetahat[i,3]=theta_cf } colMeans(thetahat) # = check the average theta for all models = #

## OLS Naive DML Cross-fiting DML ## 0.5465718 0.4155583 0.5065751

# = plot distributions = # plot(density(thetahat[,1]),xlim=c(0.3,0.7),ylim=c(0,14)) lines(density(thetahat[,2]),col=2) lines(density(thetahat[,3]),col=4) abline(v=0.5,lty=2,col=3) legend("topleft",legend=c("OLS","Naive DML","Cross-fiting DML"),col=c(1,2,4),lty=1,cex=0.7,seg.len = 0.7,bty="n")

As you can see, the only unbiased distribution is the Cross-Fitting DML. However, the model used to estimate the functions and must be able to capture the relevant information for the data. In the linear case you may use the LASSO and achieve a result similar to the double selection. Finally, what we did here was a 2-fold Cross-Fitting, but you can also do a k-fold Cross-Fitting just like you do a k-fold cross-validation. The size of k does not matter asymptotically, but in small samples the results may change.

Chernozhukov, V., Chetverikov, D., Demirer, M., Duflo, E., & Hansen, C. (2016). Double machine learning for treatment and causal parameters. arXiv preprint arXiv:1608.00060.

To **leave a comment** for the author, please follow the link and comment on their blog: ** R – insightR**.

R-bloggers.com offers

(This article was first published on ** DataCamp Blog**, and kindly contributed to R-bloggers)

Hello R users! We have a big announcement this week as well, DataCamp just released our first Spark in R course: Introduction to Spark in R using sparklyr! This course is taught by Richie Cotton.

R is mostly optimized to help you write data analysis code quickly and readably. Apache Spark is designed to analyze huge datasets quickly. The `sparklyr`

package lets you write `dplyr`

R code that runs on a Spark cluster, giving you the best of both worlds. This course teaches you how to manipulate Spark DataFrames using both the `dplyr`

interface and the native interface to Spark, as well as trying machine learning techniques. Throughout the course, you’ll explore the Million Song Dataset.

Introduction to Spark in R using sparklyr features interactive exercises that combine high-quality video, in-browser coding, and gamification for an engaging learning experience that will teach you how to work with Spark in R!

**What you’ll learn:**

Starting off you will learn how Spark and R complement each other, how to get data to and from Spark, and how to manipulate Spark data frames using dplyr syntax.

In the second chapter, you will learn more about using the `dplyr`

interface to Spark, including advanced field selection, calculating groupwise statistics, and joining data frames.

In chapter 3, you’ll learn about Spark’s machine learning data transformation features and functionality for manipulating native DataFrames.

**Chapter 4 – Case Study: Learning to be a Machine: Running Machine Learning Models on Spark**

The final chapter is a case study in which you learn to use `sparklyr`

‘s machine learning routines, by predicting the year in which a song was released.

To **leave a comment** for the author, please follow the link and comment on their blog: ** DataCamp Blog**.

R-bloggers.com offers

(This article was first published on ** Rstats – OUseful.Info, the blog…**, and kindly contributed to R-bloggers)

In advance of the recent UK general election, ODI Leeds published an interactive hexmap of constituencies to provide a navigation surface over various datasets relating to Westminster constituencies:

As well as the interactive front end, ODI Leeds published a simple JSON format for sharing the hex data – *hexjson* that allows you to specify an identifier for each, some data values associated with it, and relative row (`r`

) and column (`q`

) co-ordinates:

It’s not hard to imagine the publication of default, or **base**, *hexJSON* documents that include standard identifier codes and appropriate co-ordinates, e.g. for Westminster constituencies, wards, local authorities, and so on being developed around such a standard.

So that’s one thing the standard affords – a common way of representing lots of different datasets.

Tooling can then be developed to inject particular data-values into an appropriate `hexJSON`

file. For example, a `hexJSON`

representation of UK HEIs could add a data attribute identifying whether an HEI received a Gold, Silver or Bronze TEF rating. That’s a second thing the availability of a standard supports.

By building a UI that reads data in from a *hexJSON* file, ODI Leeds have developed an application that can presumably render other people’s *hexJSON* files, again, another benefit of a standard representation.

But the availability of the standard also means other people can build other visualisation tools around the standard. Which is exactly what Oli Hawkins did with his `d3-hexjson`

Javascript library, *“a D3 module for generating [SVG] hexmaps from data in HexJSON format”* as announced here. So that’s another thing the standard allows.

You can see an example here, created by Henry Lau:

You maybe start to get a feel for how this works… Data in a standard form, standard library that renders the data. For example, Giuseppe Sollazzo (aka @puntofisso), had a play looking at voter swing:

So… one of the things I was wondering was how easy it would be for folk in the House of Commons Library, for example, to make use of the `d3-hexjson`

maps without having to do the Javascript or HTML thing.

Step in HTMLwidgets, a (standardised) format for publishing interactive HTML widgets from Rmarkdown (Rmd). The idea is that you should be able to say something like:

`hexjsonwidget( hexjson )`

and embed a rendering of a `d3-hexjson`

map in HTML output from a *knitr*ed *Rmd* document.

(Creating the `hexjson`

as a JSON file from a base (*hexJSON*) file with custom data values added to it is the next step, and the next thing on my to do list.)

So following the HTMLwidgets tutorial, and copying Henry Lau’s example (which maybe drew on Oli’s README?) I came up with a minimal take on a hexJSON HTMLwidget.

library(devtools) install_github("psychemedia/htmlwidget-hexjson")

It’s little more than a wrapping of the demo template, and I’ve only tested it with a single example *hexJSON* file, but it does generate d3.js hexmaps:

library(jsonlite) library(hexjsonwidget) jj=fromJSON('./example.hexjson') hexjsonwidget(jj)

It also needs documenting. And support for creating data-populated base *hexJSON* files. But it’s a start. And another thing the *hexJSON* has unintentionally provided supported for.

But it does let you create HTML documents with embedded hexmaps if you have the *hexJSON* file handy:

By the by, it’s also worth noting that we can also publish an image snapshot of the SVG hexjson map in a `knitr`

rendering of the document to a PDF or Microsoft Word output document format:

At first I thought this wasn’t possible, and via @timelyportfolio found a workaround to generate an image from the SVG:

library(exportwidget) library(webshot) library(htmltools) library(magrittr) html_print(tagList( hexjsonwidget(jj) , export_widget( ) ), viewer=NULL) %>% webshot( delay = 3 )

But then noticed that the PDF rendering *was* suddenly working – it seems that if you have the `webshot`

and `htmltools`

packages installed, then the PDF and Word rendering of the HTMLwidget SVG as an image works automagically. (I’m not sure I’ve seen that documented – the related HTMLwidget Github issue still looks to be open?)

To **leave a comment** for the author, please follow the link and comment on their blog: ** Rstats – OUseful.Info, the blog…**.

R-bloggers.com offers

(This article was first published on ** blog**, and kindly contributed to R-bloggers)

Mango Solutions are excited to announce that the venue and dates have now been confirmed for the third EARL Boston Conference. Delegates will enjoy a new vibe this year, with the Conference being held in the heart of Cambridge at **The Charles Hotel on 1-3 November**.

EARL is a cross-sector Conference for commercial R users. The popular EARL format returns to Boston with one day of R workshops, two days devoted to the most innovative R implementations by the world’s leading practitioners, plus an evening networking event.

Abstract submissions for the two days of presentations are now open. Mango Solutions are accepting abstracts that focus on the commercial applications of R. This is your opportunity to share your R stories, successes and innovations with R users from other sectors and companies.

*Accepted speakers receive a day pass for the day they are presenting and a ticket to the networking reception.*

Submit your abstract online by 31 August: earlconf.com/abstracts

**Have a clear commercial focus**

Ideally, your abstract will cover a business problem that you’ve solved with R. A good format is: what the problem is/was, the approach you took and the results.

**Use an example**

Really drive your abstract home with a brief example. It helps to illustrate your problem and results

**Create a catchy title**

Let’s be honest, quite often conference delegates will decide which presentations they want to attend by the title.

**Think about the audience**

Your fellow R users will be from a range of sectors, companies and user levels and while you can’t appeal to everyone, do think about yourself sitting in the audience and what you’d like to hear from a presenter.

**Get someone to read over it**

This one seems obvious, but a second (or third) set of eyes can help you see where there might be gaps or too much information.

Important note: Keep in mind that presentations are 30 minutes, including time for questions.

*For inspiration take a look at previous speaker abstracts from the previous EARL Conferences: San Francisco 2017, London 2017, Boston 2016, London 2016.*

**Submit your abstract online by 31 August: earlconf.com/abstracts**

To **leave a comment** for the author, please follow the link and comment on their blog: ** blog**.

R-bloggers.com offers

(This article was first published on ** R-exercises**, and kindly contributed to R-bloggers)

This exercise set provides (further) practice in writing Excel documents using the `xlsx`

package as well as importing and general data manipulation. Specifically we have loops in order for you to practice scaling. A previous exercise set focused on writing a simple sheet with the same package, see here.

We will use a subset of commuting data from the Dublin area from AIRO and the 2011 Irish census.

Solutions are available here.

**Exercise 1**

Load the `xlsx`

package. If necessary install it as indicated in the previous xlsx exercise set.

**Exercise 2**

Download the data to your computer and read into your R workspace as `commuting`

using `read.xlsx2()`

or the slower alternative `read.xlsx()`

. Use `colClasses`

to set relevant classes as we will be manipulating the data later on.

**Exercise 3**

Clean the data a bit by removing `'Population_Aged_5_Over_By_'`

and `'To_Work_School_College_'`

from the column names.

**Exercise 4**

Sum the `'population aged 5 and over'`

variables by electoral division name using for instance `aggregate()`

or `data.table`

and save the result as `commuting_ed`

.

**Exercise 5**

Create an `xlsx`

workbook object in your R workspace and call it `wb`

.

**Exercise 6**

Create three sheets objects in `wb`

named `sheet1, sheet2, sheet3`

in `wb`

and your workspace. Use a loop.

**Exercise 7**

Make a `data.frame`

that lists proportion of respondents in each of the following category by electoral division: travel on foot, travel on bicycle, leave home before 6:30.

**Exercise 8**

Add the top 5 electoral division in each category to a previously created sheets with all the proportions using a loop. Leave the first row free.

**Exercise 9**

Add some great title to the first row of each sheet and apply some style to it.

**Exercise 10**

Save your workbook to your working directory and open using Excel. Go back to `R`

and continue formatting and adding information to your workbook at will.

To **leave a comment** for the author, please follow the link and comment on their blog: ** R-exercises**.

R-bloggers.com offers

(This article was first published on ** Easy Guides**, and kindly contributed to R-bloggers)

This article describes how to create easily basic and ordered **bar plots** using ggplot2 based helper functions available in the ggpubr R package. We’ll also present some modern alternatives to bar plots, including **lollipop charts** and **cleveland’s dot plots**.

Note that, the approach to build a bar plot, using ggplot2 standard verbs, has been described in our previous article available at: ggplot2 barplots : Quick start guide.

You might be also interested by the following articles:

**Contents**:

You need to install the R package ggpubr (version >= 0.1.3), to easily create ggplot2-based publication ready plots.

Install from CRAN:

`install.packages("ggpubr")`

Or, install the latest developmental version from GitHub as follow:

```
if(!require(devtools)) install.packages("devtools")
devtools::install_github("kassambara/ggpubr")
```

Load ggpubr:

`library(ggpubr)`

Create a demo data set:

```
df <- data.frame(dose=c("D0.5", "D1", "D2"),
len=c(4.2, 10, 29.5))
print(df)
```

```
dose len
1 D0.5 4.2
2 D1 10.0
3 D2 29.5
```

Basic bar plots:

```
# Basic bar plots with label
p <- ggbarplot(df, x = "dose", y = "len",
color = "black", fill = "lightgray")
p
# Rotate to create horizontal bar plots
p + rotate()
```

Change fill and outline colors by groups:

```
ggbarplot(df, x = "dose", y = "len",
fill = "dose", color = "dose", palette = "jco")
```

Create a demo data set:

```
df2 <- data.frame(supp=rep(c("VC", "OJ"), each=3),
dose=rep(c("D0.5", "D1", "D2"),2),
len=c(6.8, 15, 33, 4.2, 10, 29.5))
print(df2)
```

```
supp dose len
1 VC D0.5 6.8
2 VC D1 15.0
3 VC D2 33.0
4 OJ D0.5 4.2
5 OJ D1 10.0
6 OJ D2 29.5
```

Plot y = “len” by x = “dose” and change color by a second group: “supp”

```
# Stacked bar plots, add labels inside bars
ggbarplot(df2, x = "dose", y = "len",
fill = "supp", color = "supp",
palette = c("gray", "black"),
label = TRUE, lab.col = "white", lab.pos = "in")
# Change position: Interleaved (dodged) bar plot
ggbarplot(df2, x = "dose", y = "len",
fill = "supp", color = "supp",
palette = c("gray", "black"),
position = position_dodge(0.9))
```

Load and prepare data:

```
# Load data
data("mtcars")
dfm <- mtcars
# Convert the cyl variable to a factor
dfm$cyl <- as.factor(dfm$cyl)
# Add the name colums
dfm$name <- rownames(dfm)
# Inspect the data
head(dfm[, c("name", "wt", "mpg", "cyl")])
```

```
name wt mpg cyl
Mazda RX4 Mazda RX4 2.620 21.0 6
Mazda RX4 Wag Mazda RX4 Wag 2.875 21.0 6
Datsun 710 Datsun 710 2.320 22.8 4
Hornet 4 Drive Hornet 4 Drive 3.215 21.4 6
Hornet Sportabout Hornet Sportabout 3.440 18.7 8
Valiant Valiant 3.460 18.1 6
```

Create ordered bar plots. Change the fill color by the grouping variable “cyl”. Sorting will be done globally, but not by groups.

```
ggbarplot(dfm, x = "name", y = "mpg",
fill = "cyl", # change fill color by cyl
color = "white", # Set bar border colors to white
palette = "jco", # jco journal color palett. see ?ggpar
sort.val = "desc", # Sort the value in dscending order
sort.by.groups = FALSE, # Don't sort inside each group
x.text.angle = 90 # Rotate vertically x axis texts
)
```

Sort bars inside each group. Use the argument **sort.by.groups = TRUE**.

```
ggbarplot(dfm, x = "name", y = "mpg",
fill = "cyl", # change fill color by cyl
color = "white", # Set bar border colors to white
palette = "jco", # jco journal color palett. see ?ggpar
sort.val = "asc", # Sort the value in dscending order
sort.by.groups = TRUE, # Sort inside each group
x.text.angle = 90 # Rotate vertically x axis texts
)
```

The deviation graph shows the deviation of quantitative values to a reference value. In the R code below, we’ll plot the mpg z-score from the mtcars data set.

Calculate the z-score of the mpg data:

```
# Calculate the z-score of the mpg data
dfm$mpg_z <- (dfm$mpg -mean(dfm$mpg))/sd(dfm$mpg)
dfm$mpg_grp <- factor(ifelse(dfm$mpg_z < 0, "low", "high"),
levels = c("low", "high"))
# Inspect the data
head(dfm[, c("name", "wt", "mpg", "mpg_z", "mpg_grp", "cyl")])
```

```
name wt mpg mpg_z mpg_grp cyl
Mazda RX4 Mazda RX4 2.620 21.0 0.1508848 high 6
Mazda RX4 Wag Mazda RX4 Wag 2.875 21.0 0.1508848 high 6
Datsun 710 Datsun 710 2.320 22.8 0.4495434 high 4
Hornet 4 Drive Hornet 4 Drive 3.215 21.4 0.2172534 high 6
Hornet Sportabout Hornet Sportabout 3.440 18.7 -0.2307345 low 8
Valiant Valiant 3.460 18.1 -0.3302874 low 6
```

Create an ordered bar plot, colored according to the level of mpg:

```
ggbarplot(dfm, x = "name", y = "mpg_z",
fill = "mpg_grp", # change fill color by mpg_level
color = "white", # Set bar border colors to white
palette = "jco", # jco journal color palett. see ?ggpar
sort.val = "asc", # Sort the value in ascending order
sort.by.groups = FALSE, # Don't sort inside each group
x.text.angle = 90, # Rotate vertically x axis texts
ylab = "MPG z-score",
xlab = FALSE,
legend.title = "MPG Group"
)
```

Rotate the plot: use rotate = TRUE and sort.val = “desc”

```
ggbarplot(dfm, x = "name", y = "mpg_z",
fill = "mpg_grp", # change fill color by mpg_level
color = "white", # Set bar border colors to white
palette = "jco", # jco journal color palett. see ?ggpar
sort.val = "desc", # Sort the value in descending order
sort.by.groups = FALSE, # Don't sort inside each group
x.text.angle = 90, # Rotate vertically x axis texts
ylab = "MPG z-score",
legend.title = "MPG Group",
rotate = TRUE,
ggtheme = theme_minimal()
)
```

Lollipop chart is an alternative to bar plots, when you have a large set of values to visualize.

Lollipop chart colored by the grouping variable “cyl”:

```
ggdotchart(dfm, x = "name", y = "mpg",
color = "cyl", # Color by groups
palette = c("#00AFBB", "#E7B800", "#FC4E07"), # Custom color palette
sorting = "ascending", # Sort value in descending order
add = "segments", # Add segments from y = 0 to dots
ggtheme = theme_pubr() # ggplot2 theme
)
```

- Sort in descending order.
**sorting = “descending”**. - Rotate the plot vertically, using
**rotate = TRUE**. - Sort the mpg value inside each group by using
**group = “cyl”**. - Set
**dot.size**to 6. - Add mpg values as label.
**label = “mpg”**or**label = round(dfm$mpg)**.

```
ggdotchart(dfm, x = "name", y = "mpg",
color = "cyl", # Color by groups
palette = c("#00AFBB", "#E7B800", "#FC4E07"), # Custom color palette
sorting = "descending", # Sort value in descending order
add = "segments", # Add segments from y = 0 to dots
rotate = TRUE, # Rotate vertically
group = "cyl", # Order by groups
dot.size = 6, # Large dot size
label = round(dfm$mpg), # Add mpg values as dot labels
font.label = list(color = "white", size = 9,
vjust = 0.5), # Adjust label parameters
ggtheme = theme_pubr() # ggplot2 theme
)
```

Deviation graph:

- Use y = “mpg_z”
- Change segment color and size: add.params = list(color = “lightgray”, size = 2)

```
ggdotchart(dfm, x = "name", y = "mpg_z",
color = "cyl", # Color by groups
palette = c("#00AFBB", "#E7B800", "#FC4E07"), # Custom color palette
sorting = "descending", # Sort value in descending order
add = "segments", # Add segments from y = 0 to dots
add.params = list(color = "lightgray", size = 2), # Change segment color and size
group = "cyl", # Order by groups
dot.size = 6, # Large dot size
label = round(dfm$mpg_z,1), # Add mpg values as dot labels
font.label = list(color = "white", size = 9,
vjust = 0.5), # Adjust label parameters
ggtheme = theme_pubr() # ggplot2 theme
)+
geom_hline(yintercept = 0, linetype = 2, color = "lightgray")
```

Color y text by groups. Use y.text.col = TRUE.

```
ggdotchart(dfm, x = "name", y = "mpg",
color = "cyl", # Color by groups
palette = c("#00AFBB", "#E7B800", "#FC4E07"), # Custom color palette
sorting = "descending", # Sort value in descending order
rotate = TRUE, # Rotate vertically
dot.size = 2, # Large dot size
y.text.col = TRUE, # Color y text by groups
ggtheme = theme_pubr() # ggplot2 theme
)+
theme_cleveland() # Add dashed grids
```

This analysis has been performed using **R software** (ver. 3.3.2) and **ggpubr** (ver. 0.1.4).

To **leave a comment** for the author, please follow the link and comment on their blog: ** Easy Guides**.

R-bloggers.com offers

(This article was first published on ** R tutorial for Spatial Statistics**, and kindly contributed to R-bloggers)

As part of my new role as Lecturer in Agri-data analysis at Harper Adams University, I found myself applying a lot of techniques based on linear modelling. Another thing I noticed is that there is a lot of confusion among researchers in regards to what technique should be used in each instance and how to interpret the model. For this reason I started reading material from books and on-line to try and create a sort of reference tutorial that researchers can use. This post is the result of my work so far.

Please feel free to comment, provide feedback and constructive criticism!!

The classic linear model forms the basis for ANOVA (with categorical treatments) and ANCOVA (which deals with continuous explanatory variables). Its basic equation is the following:

where β_0 is the intercept (i.e. the value of the line at zero), β_1 is the slope for the variable x, which indicates the changes in y as a function of changes in x. For example if the slope is +0.5, we can say that for each unit increment in x, y increases of 0.5. Please note that the slope can also be negative.

This equation can be expanded to accommodate more that one explanatory variable x:

In this case the interpretation is a bit more complex because for example the coefficient β_2 provides the slope for the explanatory variable x_2. This means that for a unit variation of x_2 the target variable y changes by the value of β_2, if the other explanatory variables are kept constant.

In case our model includes interactions, the linear equation would be changed as follows:

notice the interaction term between x_1 and x_2. In this case the interpretation becomes extremely difficult just by looking at the model.

In fact, if we rewrite the equation focusing for example on x_1:

we can see that its slope become affected by the value of x_2 (Yan & Su, 2009), for this reason the only way we can actually determine how x_1 changes Y, when the other terms are kept constant, is to use the equation with new values of x_1.

This linear model can be applied to continuous target variables, in this case we would talk about an ANCOVA for exploratory analysis, or a linear regression if the objective was to create a predictive model.

The Analysis of variance is based on the linear model presented above, the only difference is that its reference point is the mean of the dataset. When we described the equations above we said that to interpret the results of the linear model we would look at the slope term; this indicates the rate of changes in Y if we change one variable and keep the rest constant. The ANOVA calculates the effects of each treatment based on the grand mean, which is the mean of the variable of interest.

In mathematical terms ANOVA solves the following equation (Williams, 2004):

where y is the effect on group j of treatment τ_1, while μ is the grand mean (i.e. the mean of the whole dataset). From this equation is clear that the effects calculated by the ANOVA are not referred to unit changes in the explanatory variables, but are all related to changes on the grand mean.

For this example we are going to use one of the datasets available in the package agridatavailable in CRAN:

` install.packages("agridat") `

We also need to include other packages for the examples below. If some of these are not installed in your system please use again the function install.packages (replacing the name within quotation marks according to your needs) to install them.

` library(agridat) `

library(ggplot2)

library(plotrix)

library(moments)

library(car)

library(fitdistrplus)

library(nlme)

library(multcomp)

library(epade)

library(lme4)

Now we can load the dataset lasrosas.corn, which has more that 3400 observations of corn yield in a field in Argentina, plus several explanatory variables both factorial (or categorical) and continuous.

` > dat = lasrosas.corn `

> str(dat)

'data.frame': 3443 obs. of 9 variables:

$ year : int 1999 1999 1999 1999 1999 1999 1999 1999 1999 1999 ...

$ lat : num -33.1 -33.1 -33.1 -33.1 -33.1 ...

$ long : num -63.8 -63.8 -63.8 -63.8 -63.8 ...

$ yield: num 72.1 73.8 77.2 76.3 75.5 ...

$ nitro: num 132 132 132 132 132 ...

$ topo : Factor w/ 4 levels "E","HT","LO",..: 4 4 4 4 4 4 4 4 4 4 ...

$ bv : num 163 170 168 177 171 ...

$ rep : Factor w/ 3 levels "R1","R2","R3": 1 1 1 1 1 1 1 1 1 1 ...

$ nf : Factor w/ 6 levels "N0","N1","N2",..: 6 6 6 6 6 6 6 6 6 6 ...

Important for the purpose of this tutorial is the target variable yield, which is what we are trying to model, and the explanatory variables: topo (topographic factor), bv (brightness value, which is a proxy for low organic matter content) and nf (factorial nitrogen levels). In addition we have rep, which is the blocking factor.

Since we are planning to use an ANOVA we first need to check that our data fits with its assumptions. ANOVA is based on three assumptions:

- Data independence
- Normality
- Equality of variances between groups
- Balance design (i.e. all groups have the same number of samples)

Let’s see how we can test for them in R. Clearly we are talking about environmental data so the assumption of independence is not met, because data are autocorrelated with distance. Theoretically speaking, for spatial data ANOVA cannot be employed and more robust methods should be employed (e.g. REML); however, over the years it has been widely used for analysis of environmental data and it is accepted by the community. That does not mean that it is the correct method though, and later on in this tutorial we will see the function to perform linear modelling with REML.

The third assumption is the one that is most easy to assess using the function tapply:

` > tapply(dat$yield, INDEX=dat$nf, FUN=var) `

N0 N1 N2 N3 N4 N5

438.5448 368.8136 372.8698 369.6582 366.5705 405.5653

In this case we used tapply to calculate the variance of yield for each subgroup (i.e. level of nitrogen). There is some variation between groups but in my opinion it is not substantial. Now we can shift our focus on normality. There are tests to check for normality, but again the ANOVA is flexible (particularly where our dataset is big) and can still produce correct results even when its assumptions are violated up to a certain degree. For this reason, it is good practice to check normality with descriptive analysis alone, without any statistical test. For example, we could start by plotting the histogram of yield:

` hist(dat$yield, main="Histogram of Yield", xlab="Yield (quintals/ha)") `

By looking at this image it seems that our data are more or less normally distributed. Another plot we could create is the QQplot (http://www.itl.nist.gov/div898/handbook/eda/section3/qqplot.htm):

` qqnorm(dat$yield, main="QQplot of Yield") `

qqline(dat$yield)

For normally distributed data the points should all be on the line. This is clearly not the case but again the deviation is not substantial. The final element we can calculate is the skewness of the distribution, with the function skewnessin the package moments:

` > skewness(dat$yield) `

[1] 0.3875977

According to Webster and Oliver (2007) is the skewness is below 0.5, we can consider the deviation from normality not big enough to transform the data. Moreover, according to Witte and Witte (2009) if we have more than 10 samples per group we should not worry too much about violating the assumption of normality or equality of variances.

To see how many samples we have for each level of nitrogen we can use once again the function tapply:

` > tapply(dat$yield, INDEX=dat$nf, FUN=length) `

N0 N1 N2 N3 N4 N5

573 577 571 575 572 575

As you can see we have definitely more than 10 samples per group, but our design is not balanced (i.e. some groups have more samples). This implies that the normal ANOVA cannot be used, this is because the standard way of calculating the sum of squares is not appropriate for unbalanced designs (look here for more info: http://goanna.cs.rmit.edu.au/~fscholer/anova.php).

In summary, even though from the descriptive analysis it appears that our data are close to being normal and have equal variance, our design is unbalanced, therefore the normal way of doing ANOVA cannot be used. In other words we cannot function aovfor this dataset. However, since this is a tutorial we are still going to start by applying the normal ANOVA with aov.

The first thing we need to do is think about the hypothesis we would like to test. For example, we could be interested in looking at nitrogen levels and their impact on yield. Let’s start with some plotting to better understand our data:

` means.nf = tapply(dat$yield, INDEX=dat$nf, FUN=mean) `

StdErr.nf = tapply(dat$yield, INDEX=dat$nf, FUN= std.error)

BP = barplot(means.nf, ylim=c(0,max(means.nf)+10))

segments(BP, means.nf - (2*StdErr.nf), BP,

means.nf + (2*StdErr.nf), lwd = 1.5)

arrows(BP, means.nf - (2*StdErr.nf), BP,

means.nf + (2*StdErr.nf), lwd = 1.5, angle = 90,

code = 3, length = 0.05)

This code first uses the function tapply to compute mean and standard error of the mean for yield in each nitrogen group. Then it plots the means as bars and creates error bars using the standard error (please remember that with a normal distribution ± twice the standard error provides a 95% confidence interval around the mean value). The result is the following image:

By plotting our data we can start figuring out what is the interaction between nitrogen levels and yield. In particular, there is an increase in yield with higher level of nitrogen. However, some of the error bars are overlapping, and this may suggest that their values are not significantly different. For example, by looking at this plot N0 and N1 have error bars very close to overlap, but probably not overlapping, so it may be that N1 provides a significant different from N0. The rest are all probably significantly different from N0. For the rest their interval overlap most of the times, so their differences would probably not be significant.

We could formulate the hypothesis that nitrogen significantly affects yield and that the mean of each subgroup are significantly different. Now we just need to test this hypothesis with a one-way ANOVA:

` mod1 = aov(yield ~ nf, data=dat) `

The code above uses the function aov to perform an ANOVA; we can specify to perform a one-way ANOVA simply by including only one factorial term after the tilde (~) sign. We can plot the ANOVA table with the function summary:

` > summary(mod1) `

Df Sum Sq Mean Sq F value Pr(>F)

nf 5 23987 4797 12.4 6.08e-12 ***

Residuals 3437 1330110 387

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

It is clear from this output that nitrogen significantly affects yield, so we tested our first hypothesis. To test the significance for individual levels of nitrogen we can use the Tukey’s test:

` > TukeyHSD(mod1, conf.level=0.95) `

Tukey multiple comparisons of means

95% family-wise confidence level

Fit: aov(formula = yield ~ nf, data = dat)

$nf

diff lwr upr p adj

N1-N0 3.6434635 0.3353282 6.951599 0.0210713

N2-N0 4.6774357 1.3606516 7.994220 0.0008383

N3-N0 5.3629638 2.0519632 8.673964 0.0000588

N4-N0 7.5901274 4.2747959 10.905459 0.0000000

N5-N0 7.8588595 4.5478589 11.169860 0.0000000

N2-N1 1.0339723 -2.2770686 4.345013 0.9489077

N3-N1 1.7195004 -1.5857469 5.024748 0.6750283

N4-N1 3.9466640 0.6370782 7.256250 0.0089057

N5-N1 4.2153960 0.9101487 7.520643 0.0038074

N3-N2 0.6855281 -2.6283756 3.999432 0.9917341

N4-N2 2.9126917 -0.4055391 6.230923 0.1234409

N5-N2 3.1814238 -0.1324799 6.495327 0.0683500

N4-N3 2.2271636 -1.0852863 5.539614 0.3916824

N5-N3 2.4958957 -0.8122196 5.804011 0.2613027

N5-N4 0.2687320 -3.0437179 3.581182 0.9999099

There are significant differences between the control and the rest of the levels of nitrogen, plus other differences between N4 and N5 compared to N1, but nothing else. If you look back at the bar chart we produced before, and look carefully at the overlaps between error bars, you will see that for example N1, N2, and N3 have overlapping error bars, thus they are not significantly different. On the contrary, N1 has no overlaps with either N4 and N5 , which is what we demonstrated in the ANOVA.

The function model.tables provides a quick way to print the table of effects and the table of means:

` > model.tables(mod1, type="effects") `

Tables of effects

nf

N0 N1 N2 N3 N4 N5

-4.855 -1.212 -0.178 0.5075 2.735 3.003

rep 573.000 577.000 571.000 575.0000 572.000 575.000

These values are all referred to the gran mean, which we can simply calculate with the function mean(dat$yield) and it is equal to 69.83. This means that the mean for N0 would be 69.83-4.855 = 64.97. we can verify that with another call to the function model.tables, this time with the option type=”means”:

` > model.tables(mod1, type="means") `

Tables of means

Grand mean

69.82831

nf

N0 N1 N2 N3 N4 N5

64.97 68.62 69.65 70.34 72.56 72.83

rep 573.00 577.00 571.00 575.00 572.00 575.00

The same results can be obtain by fitting a linear model with the function lm, only their interpretation would be different. The assumption for fitting a linear models are again independence (which is always violated with environmental data), and normality.

Let’s look at the code:

` mod2 = lm(yield ~ nf, data=dat) `

This line fits the same model but with the standard linear equation. This become clearer by looking at the summary table:

` > summary(mod2) `

Call:

lm(formula = yield ~ nf, data = dat)

Residuals:

Min 1Q Median 3Q Max

-52.313 -15.344 -3.126 13.563 45.337

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 64.9729 0.8218 79.060 < 2e-16 ***

nfN1 3.6435 1.1602 3.140 0.0017 **

nfN2 4.6774 1.1632 4.021 5.92e-05 ***

nfN3 5.3630 1.1612 4.618 4.01e-06 ***

nfN4 7.5901 1.1627 6.528 7.65e-11 ***

nfN5 7.8589 1.1612 6.768 1.53e-11 ***

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 19.67 on 3437 degrees of freedom

Multiple R-squared: 0.01771, Adjusted R-squared: 0.01629

F-statistic: 12.4 on 5 and 3437 DF, p-value: 6.075e-12

There are several information in this table that we should clarify. First of all it already provides with some descriptive measures for the residuals, from which we can see that their distribution is relatively normal (first and last quartiles have similar but opposite values and the same is true for minimum and maximum). Then we have the table of the coefficients, with the intercept and all the slopes. As you can see the level N0 is not shown in the list; this is called the reference level, which means that all the other are referenced back to it. In other words, the value of the intercept is the mean of nitrogen level 0 (in fact is the same we calculated above 64.97). To calculate the means for the other groups we need to sum the value of the reference level with the slopes. For example N1 is 64.97 + 3.64 = 68.61 (the same calculated from the ANOVA). The p-value and the significance are again in relation to the reference level, meaning for example that N1 is significantly different from N0 (reference level) and the p-value is 0.0017. This is similar to the Tukey’s test we performed above, but it is only valid in relation to N0. We need to change the reference level, and fit another model, to get the same information for other nitrogen levels:

` dat$nf = relevel(dat$nf, ref="N1") `

mod3 = lm(yield ~ nf, data=dat)

summary(mod3)

Now the reference level is N1, so all the results will tell us the effects of nitrogen in relation to N1.

` > summary(mod3) `

Call:

lm(formula = yield ~ nf, data = dat)

Residuals:

Min 1Q Median 3Q Max

-52.313 -15.344 -3.126 13.563 45.337

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 68.616 0.819 83.784 < 2e-16 ***

nfN0 -3.643 1.160 -3.140 0.001702 **

nfN2 1.034 1.161 0.890 0.373308

nfN3 1.720 1.159 1.483 0.138073

nfN4 3.947 1.161 3.400 0.000681 ***

nfN5 4.215 1.159 3.636 0.000280 ***

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 19.67 on 3437 degrees of freedom

Multiple R-squared: 0.01771, Adjusted R-squared: 0.01629

F-statistic: 12.4 on 5 and 3437 DF, p-value: 6.075e-12

For example, we can see that N0 has a lower value compared to N1, and that only N0, N4 and N5 are significantly different from N1, which is what we saw from the bar chart and what we found from the Tukey’s test.

Interpreting the output of the function aov is much easier compare to lm. However, in many cases we can only use the function lm (for example in an ANCOVA where alongside categorical we have continuous explanatory variables) so it is important that we learn how to interpret its summary table.

We can obtain the ANOVA table with the function anova:

` > anova(mod2) `

Analysis of Variance Table

Response: yield

Df Sum Sq Mean Sq F value Pr(>F)

nf 5 23987 4797.4 12.396 6.075e-12 ***

Residuals 3437 1330110 387.0

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

This uses the type I sum of squares (more info at: http://www.utstat.utoronto.ca/reid/sta442f/2009/typeSS.pdf), which is the standard way and it is not indicated for unbalanced designs. The function Anova in the package car has the option to select which type of sum of squares to calculate and we can specify type=c(“III”)to correct for the unbalanced design:

` > Anova(mod2, type=c("III")) `

Anova Table (Type III tests)

Response: yield

Sum Sq Df F value Pr(>F)

(Intercept) 2418907 1 6250.447 < 2.2e-16 ***

nf 23987 5 12.396 6.075e-12 ***

Residuals 1330110 3437

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

In this example the two results are the same, probably the large sample size helps in this respect. However, for smaller samples this distinction may become important. For this reason, if your design is unbalanced please remember not to use the function aov, but always lmand Anova with the option for type III sum of squares.

So far we have looked on the effect of nitrogen on yield. However, in the dataset we also have a factorial variable named topo, which stands for topographic factor and has 4 levels: W = West slope, HT = Hilltop, E = East slope, LO = Low East. We already formulated an hypothesis about nitrogen, so now we need to formulate an hypothesis about topo as well. Once again we can do that by using the function tapply and a simple bar charts with error bars. Look at the code below:

` means.topo = tapply(dat$yield, INDEX=dat$topo, FUN=mean) `

StdErr.topo = tapply(dat$yield, INDEX=dat$topo, FUN= std.error)

BP = barplot(means.topo, ylim=c(0,max(means.topo)+10))

segments(BP, means.topo - (2*StdErr.topo), BP,

means.topo + (2*StdErr.topo), lwd = 1.5)

arrows(BP, means.topo - (2*StdErr.topo), BP,

means.topo + (2*StdErr.topo), lwd = 1.5, angle = 90,

code = 3, length = 0.05)

Here we are using the same exact approach we used before to formulate an hypothesis about nitrogen. We first calculate mean and standard error of yield for each level of topo, and then plot a bar chart with error bars.

The result is the plot below:

From this plot it is clear that the topographic factor has an effect on yield. In particular, hilltop areas have low yield while the low east corner of the field has high yield. From the error bars we can say with a good level of confidence that probably all the differences will be significant, at least up to an alpha of 95% (significant level, meaning a p-value of 0.05).

We can test this hypothesis with a two way ANOVA, by simply adding the term topo to the equation:

` mod1b = aov(yield ~ nf + topo, data=dat) `

summary(mod1b)

Df Sum Sq Mean Sq F value Pr(>F)

nf 5 23987 4797 23.21 <2e-16 ***

topo 3 620389 206796 1000.59 <2e-16 ***

Residuals 3434 709721 207

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

From the summary table it is clear that both factors have a significant effect on yield, but just by looking at this it is very difficult to identify clearly which levels are the significant ones. Top do that we need the Tukey’s test:

` TukeyHSD(mod1b, conf.level=0.95, which=c("topo")) `

Tukey multiple comparisons of means

95% family-wise confidence level

Fit: aov(formula = yield ~ nf + topo, data = dat)

$topo

diff lwr upr p adj

HT-LO -36.240955 -38.052618 -34.429291 0

W-LO -18.168544 -19.857294 -16.479794 0

E-LO -6.206619 -8.054095 -4.359143 0

W-HT 18.072411 16.326440 19.818381 0

E-HT 30.034335 28.134414 31.934257 0

E-W 11.961925 10.178822 13.745028 0

The zero p-values indicate a large significance for each combination, as it was clear from the plot. With the function model.table you can easily obtain a table of means or effects, if you are interested.

One step further we can take to get more insights into our data is add an interaction between nitrogen and topo, and see if this can further narrow down the main sources of yield variation. Once again we need to start our analysis by formulating an hypothesis. Since we are talking about an interaction we are now concern in finding a way to plot yield responses for varying nitrogen level and topographic position, so we need a 3d bar chart. We can do that with the function bar3d.ade from the package epade, so please install this package and load it.

Then please look at the following R code:

` dat$topo = factor(dat$topo,levels(dat$topo)[c(2,4,1,3)]) `

means.INT = tapply(dat$yield, INDEX=list(dat$nf, dat$topo), FUN=mean)

bar3d.ade(means.INT, col="grey")

The first line is only used to reorder the levels in the factorial variable topo. This is because from the previous plot we clearly saw that HT is the one with the lowest yield, followed by W, E and LO. We are doing this only to make the 3d bar chart more readable.

The next line applies once again the function tapply, this time to calculate the mean of yield for subgroups divided by nitrogen and topographic factors. The result is a matrix that looks like this:

` > means.INT `

LO HT W E

N0 81.03027 41.50652 62.08192 75.13902

N1 83.06276 48.33630 65.74627 78.12808

N2 85.06879 48.79830 66.70848 78.92632

N3 85.23255 50.18398 66.16531 78.99210

N4 87.14400 52.12039 70.10682 80.39213

N5 87.94122 51.03138 69.65933 80.55078

This can be used directly within the function bar3d.ade to create the 3d bar chart below:

From this plot we can see two things very clearly: the first is that there is an increase in yield from HT to LO in the topographic factor, the second is that we have again and increase from N0 to N1 in the nitrogen levels. These were all expected since we already noticed them before. What we do not see in these plot is any particular influence from the interaction between topography and nitrogen. For example, if you look at HT, you have an increase in yield from N0 to N5 (expected) and overall the yield is lower than the other bars (again expected). If there was an interaction we would expect this general pattern to change, for example with relatively high yield on the hilltop at high nitrogen level, or very low yield in the low east side with N0. This does not happen and all the bars follow an expected pattern, so we can hypothesise that the interaction will not be significant.

To test that we need to run another ANOVA with an interaction term:

` mod1c = aov(yield ~ nf * topo, data=dat) `

This formula test for both main effects and their interaction. To see the significance we can use the summary table:

` > summary(mod1c) `

Df Sum Sq Mean Sq F value Pr(>F)

nf 5 23987 4797 23.176 <2e-16 ***

topo 3 620389 206796 999.025 <2e-16 ***

nf:topo 15 1993 133 0.642 0.842

Residuals 3419 707727 207

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

From this we can conclude that our hypothesis was correct and that the interaction has no effect on yield.

We can have a better ides of the interaction effect by using some functions in the package phia:

` library(phia) `

plot(interactionMeans(mod1c))

This function plots the effects of the interactions in a 2 by 2 plot, including the standard error of the coefficients, so that we can readily see which overlap:

We already knew from the 3d plot that there is a general increase between N0 and N5 that mainly drives the changes we see in the data. However, from the top-right plot we can see that topo plays a little role between N0 and the other (in fact the black line only slightly overlap with the other), but it has no effect on N1 to N5.

We can look at the numerical break-out of what we see in the plot with another function:

` > testInteractions(mod1c) `

F Test:

P-value adjustment method: holm

Value Df Sum of Sq F Pr(>F)

N0-N1 : HT-W -3.1654 1 377 1.8230 1

N0-N2 : HT-W -2.6652 1 267 1.2879 1

N0-N3 : HT-W -4.5941 1 784 3.7880 1

N0-N4 : HT-W -2.5890 1 250 1.2072 1

N0-N5 : HT-W -1.9475 1 140 0.6767 1

N1-N2 : HT-W 0.5002 1 9 0.0458 1

N1-N3 : HT-W -1.4286 1 76 0.3694 1

N1-N4 : HT-W 0.5765 1 12 0.0604 1

N1-N5 : HT-W 1.2180 1 55 0.2669 1

N2-N3 : HT-W -1.9289 1 139 0.6711 1

N2-N4 : HT-W 0.0762 1 0 0.0011 1

N2-N5 : HT-W 0.7178 1 19 0.0924 1

N3-N4 : HT-W 2.0051 1 149 0.7204 1

The table is very long so only the first lines are included. However, from this it is clear that the interaction has no effect (p-value of 1), but if it was this function can give us numerous details about the specific effects.

The Analysis of covariance (ANCOVA) fits a new model where the effects of the treatments (or factorial variables) is corrected for the effect of continuous covariates, for which we can also see the effects on yield.

The code is very similar to what we saw before, and again we can perform an ANCOVA with the lm function; the only difference is that here we are including an additional continuous explanatory variable in the model:

` > mod3 = lm(yield ~ nf + bv, data=dat) `

> summary(mod3)

Call:

lm(formula = yield ~ nf + bv, data = dat)

Residuals:

Min 1Q Median 3Q Max

-78.345 -10.847 -3.314 10.739 56.835

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 271.55084 4.99308 54.385 < 2e-16 ***

nfN0 -3.52312 0.95075 -3.706 0.000214 ***

nfN2 1.54761 0.95167 1.626 0.103996

nfN3 2.08006 0.94996 2.190 0.028619 *

nfN4 3.82330 0.95117 4.020 5.96e-05 ***

nfN5 4.47993 0.94994 4.716 2.50e-06 ***

bv -1.16458 0.02839 -41.015 < 2e-16 ***

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 16.12 on 3436 degrees of freedom

Multiple R-squared: 0.3406, Adjusted R-squared: 0.3394

F-statistic: 295.8 on 6 and 3436 DF, p-value: < 2.2e-16

By printing the summary table we can already see some differences compared to the model we only nitrogen as explanatory variable. The first is related to the Adjusted R-squared (which is simply the R-squared corrected for the number of predictors so that it is less affected by overfitting), which in this case is around 0.3. If we look back at the summary table of the model with only nitrogen, the R-squared was only 0.01. This means that by adding the continuous variable bv we are able to massively increase the explanatory power of the model; in fact, this new model is capable of explaining 33% of the variation in yield. Moreover, we can also see that other terms become significant, for example N3. This is because the inclusion of bv changes the entire model and its interpretation becomes less obvious compared to the simple bar chart we plotted at the beginning.

The interpretation of the ANCOVA model is more complex that the one for the one-way ANOVA. In fact, the intercept value has changed and it is not the mean of the reference level N1. This is because the model now changes based on the covariate bv. The slope can be used to assess the relative impact of each term; for example, N0 has a negative impact on yield in relation to its reference level. Therefore, shifting from a nitrogen level N1 to N0 decreases the yield by -3.52, if bv is kept constant.

Take a look at the following code:

` > test.set = data.frame(nf="N1", bv=150) `

> predict(mod3, test.set)

1 2

96.86350 38.63438

>

> test.set = data.frame(nf="N0", bv=150)

> predict(mod3, test.set)

1

93.34037

Here we are using the model (mod3) to estimate new values of yield based on set parameters. In the first example we set nf to N1 (reference level) and bv constant at 150. With the function predictwe can see estimate these new values using mod3. For N1 and bv equal to 150 the yield is 96.86.

In the second example we did the same but for nitrogen level N0. The result is a yield equal to 93.34, that is a difference of exactly 3.52, which is the slope of the model.

For computing the ANOVA table, we can again use either the function anova (if the design is balanced) or Anova with type III (for unbalanced designs).

Let’s look now at another example with a slightly more complex model where we include two factorial and one continuous variable. We also include in the model the variable topo. We can check these with the function levels:

` > levels(dat$topo) `

[1] "E" "HT" "LO" "W"

Please notice that E is the first and therefore is the reference level for this factor. Now let’s fit the model and look at the summary table:

` > mod4 = lm(yield ~ nf + bv + topo, data=dat) `

>

> summary(mod4)

Call:

lm(formula = yield ~ nf + bv + topo, data = dat)

Residuals:

Min 1Q Median 3Q Max

-46.394 -10.927 -2.211 10.364 47.338

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 171.20921 5.28842 32.374 < 2e-16 ***

nfN0 -3.73225 0.81124 -4.601 4.36e-06 ***

nfN2 1.29704 0.81203 1.597 0.1103

nfN3 1.56499 0.81076 1.930 0.0537 .

nfN4 3.71277 0.81161 4.575 4.94e-06 ***

nfN5 3.88382 0.81091 4.789 1.74e-06 ***

bv -0.54206 0.03038 -17.845 < 2e-16 ***

topoHT -24.11882 0.78112 -30.877 < 2e-16 ***

topoLO 3.13643 0.70924 4.422 1.01e-05 ***

topoW -10.77758 0.66708 -16.156 < 2e-16 ***

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 13.75 on 3433 degrees of freedom

Multiple R-squared: 0.5204, Adjusted R-squared: 0.5191

F-statistic: 413.8 on 9 and 3433 DF, p-value: < 2.2e-16

The adjusted R-squared increases again and now we are able to explain around 52% of the variation in yield.

The interpretation is similar to what we said before, the only difference is that here both factors have a reference level. So for example, the effect of topoHT is related to the reference level, which is the one not shown E. So if we change the topographic position from E to HT, while keeping everything else in the model constant (meaning same value of bv and same nitrogen level), we obtain a decrease in yield of 24.12.

Another thing we can see from this table is that the p-value change, and for example N3 becomes less significant. This is probably because when we consider more variables the effect of N3 on yield is explained by other variables, maybe partly bv and partly topo.

One last thing we can check, and this is something we should check every time we perform an ANOVA or a linear model is the normality of the residuals. We already saw that the summary table provides us with some data about the residuals distribution (minimum, first quartile, median, third quartile and maximum) that gives us a good indication of normality, since the distribution is centred around 0. However, we can also use other tools to check this. For example a QQ plot:

` RES = residuals(mod4) `

qqnorm(RES)

qqline(RES)

The function residuals automatically extract the residuals from the model, which can then be used to create the following plot:

It looks approximately normal, but to have a further confirmation we can use again the function skewness, which returns a value below 0.5, so we can consider this a normal distribution.

Let’s now add a further layer of complexity by adding an interaction term between bv and topo. Once again we need to formulate an hypothesis before proceeding to test it. Since we are interested in an interaction between a continuous variable (bv) and a factorial variable (topo) on yield, we could try to create scatterplots of yield versus bv, for the different levels in topo. We can easily do that with the package ggplot2:

` qplot(yield, bv, data=dat, geom="point", xlab="Yield", ylab="bv") + `

facet_wrap(~topo)+

geom_smooth(method = "lm", se = TRUE)

Explaining every bit of the three lines of code above would require some time and it is beyond the scope of this tutorial. In essence, these lines create a scatterplot yield versus bv for each subgroup of topo and then fit a linear regression line through the points. For more info about the use of ggplot2 please start by looking here: http://www.statmethods.net/advgraphs/ggplot2.html

This create the plot below:

From this plot it is clear that the four lines have different slopes, so the interaction between bv and topo may well be significant and help us further increase the explanatory power of our model. We can test that by adding this interaction:

` mod5 = lm(yield ~ nf + bv * topo, data=dat) `

We can use the function Anova to check the significance:

` > Anova(mod5, type=c("III")) `

Anova Table (Type III tests)

Response: yield

Sum Sq Df F value Pr(>F)

(Intercept) 20607 1 115.225 < 2.2e-16 ***

nf 23032 5 25.758 < 2.2e-16 ***

bv 5887 1 32.920 1.044e-08 ***

topo 40610 3 75.691 < 2.2e-16 ***

bv:topo 36059 3 67.209 < 2.2e-16 ***

Residuals 613419 3430

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

As you can see this interaction is significant. To check the details we can look at the summary table:

` > summary(mod5) `

Call:

lm(formula = yield ~ nf + bv * topo, data = dat)

Residuals:

Min 1Q Median 3Q Max

-46.056 -10.328 -1.516 9.622 50.184

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 93.45783 8.70646 10.734 < 2e-16 ***

nfN1 3.96637 0.78898 5.027 5.23e-07 ***

nfN2 5.24313 0.79103 6.628 3.93e-11 ***

nfN3 5.46036 0.79001 6.912 5.68e-12 ***

nfN4 7.52685 0.79048 9.522 < 2e-16 ***

nfN5 7.73646 0.79003 9.793 < 2e-16 ***

bv -0.27108 0.04725 -5.738 1.04e-08 ***

topoW 88.11105 12.07428 7.297 3.63e-13 ***

topoE 236.12311 17.12941 13.785 < 2e-16 ***

topoLO -15.76280 17.27191 -0.913 0.3615

bv:topoW -0.41393 0.06726 -6.154 8.41e-10 ***

bv:topoE -1.21024 0.09761 -12.399 < 2e-16 ***

bv:topoLO 0.28445 0.10104 2.815 0.0049 **

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 13.37 on 3430 degrees of freedom

Multiple R-squared: 0.547, Adjusted R-squared: 0.5454

F-statistic: 345.1 on 12 and 3430 DF, p-value: < 2.2e-16

The R-squared is a bit higher, which means that we can explain more of the variability in yield by adding the interaction. For the interpretation, once again everything is related to the reference levels in the factors, even the interaction. So for example, bv:topoW tells us that the interaction between bv and topo changes the yield negatively if we change from HT to W, keeping everything else constant.

For information about individual changes we would need to use the model to estimate new data as we did for mod3.

As we mentioned, there are certain assumptions we need to check before starting an analysis with linear models. Assumptions about normality and equality of variance can be relaxed, particularly if sample sizes are large enough. However, other assumptions for example balance in the design and independence tend to be stricter, and we need to be careful in violating them.

We can check for independence by looking at the correlation among the coefficient directly in the summary table:

` > summary(mod5, cor=T) `

[…]

Correlation of Coefficients:

(Intercept) nfN1 nfN2 nfN3 nfN4 nfN5 bv topoW topoE topoLO

nfN1 -0.05

nfN2 -0.04 0.50

nfN3 -0.05 0.50 0.50

nfN4 -0.05 0.50 0.50 0.50

nfN5 -0.05 0.50 0.50 0.50 0.50

bv -1.00 0.01 -0.01 0.01 0.00 0.00

topoW -0.72 0.00 0.00 0.00 0.00 0.00 0.72

topoE -0.51 0.01 0.02 0.03 0.01 0.02 0.51 0.37

topoLO -0.50 -0.02 -0.01 0.02 -0.01 0.02 0.50 0.36 0.26

bv:topoW 0.70 0.00 0.00 0.00 0.00 0.00 -0.70 -1.00 -0.36 -0.35

bv:topoE 0.48 -0.01 -0.02 -0.03 -0.01 -0.02 -0.48 -0.35 -1.00 -0.24

bv:topoLO 0.47 0.02 0.01 -0.02 0.01 -0.02 -0.47 -0.34 -0.24 -1.00

bv:topoW bv:topoE

nfN1

nfN2

nfN3

nfN4

nfN5

bv

topoW

topoE

topoLO

bv:topoW

bv:topoE 0.34

bv:topoLO 0.33 0.23

If we exclude the interaction, which would clearly be correlated with the single covariates, the rest of the coefficients are not much correlated. From this we may conclude that our assumption of independence holds true for this dataset. In cases where from this table we see a relatively high correlation among coefficients, we would need to use a more robust method of maximum likelihood (ML) and residuals maximum likelihood (REML) for computing the coefficients. This can be done with the function gls in the package nlme, using the same syntax as for lm:

` mod6 = gls(yield ~ nf + bv * topo, data=dat, method="REML") `

Anova(mod6, type=c("III"))

summary(mod6)

As you can see despite the different function (gls instead of lm), the rest of the syntax is the same as before. We can still use the function Anova to print the ANOVA table and summary to check the individual coefficients.

Moreover, we can also use the function anovato compare the two models (the one from glsand lm) and see which is the best performer:

` > anova(mod6, mod5) `

Model df AIC BIC logLik

mod6 1 14 27651.21 27737.18 -13811.61

mod5 2 14 27651.21 27737.18 -13811.61

The indexes AIC, BIC and logLik are all used to check the accuracy of the model and should be as low as possible. For more info please look at the appendix about assessing the accuracy of our model.

This class of models are used to account for more than one source of random variation. For example, assume we have a dataset where again we are trying to model yield as a function of nitrogen level. However, this time the data were collected in many different farms. In this case would need to be consider a cluster and the model would need to take this clustering into account. Another common set of experiments where linear mixed-effects models are used is repeated measures where time provide an additional source of correlation between measures. For these models we do not need to worry about the assumptions from previous models, since these are very robust against all of them. For example, for unbalanced design with blocking, probably these methods should be used instead of the standard ANOVA.

At the beginning on this tutorial we explored the equation that supports linear model:

This equation can be seen as split into two components, the fixed and random effects. For fixed effect we refer to those variables we are using to explain the model. These may be factorial (in ANOVA), continuous or a mixed of the two (ANCOVA) and they can also be the blocks used in our design. The other component in the equation is the random effect, which provides a level of uncertainty that it is difficult to account in the model. For example, when we work with yield we might see differences between plants grown from similar soils and conditions. These may be related to the seeds or to other factors and are part of the within-subject variation that we cannot explain.

There are times however where in the data there are multiple sources of random variation. For example, data may be clustered in separate field or separate farms. This will provide an additional source of random variation that needs to be taken into account in the model. To do so the standard equation can be amended in the following way:

This is referred to as a random intercept model, where the random variation is split into a cluster specific variation *u* and the standard error term. A more complex form, that is normally used for repeated measures is the random slope and intercept model:

Where we add a new source of random variation *v* related to time *T*.

Just to explain the syntax to use linear mixed-effects model in R for cluster data, we will assume that the factorial variable rep in our dataset describe some clusters in the data. To fit a mixed-effects model we are going to use the function lme from the package nlme. This function can work with unbalanced designs:

` lme1 = lme(yield ~ nf + bv * topo, random= ~1|rep, data=dat) `

The syntax is very similar to all the models we fitted before, with a general formula describing our target variable yield and all the treatments, which are the fixed effects of the model. Then we have the option random, which allows us to include an additional random component for the clustering factor rep. In this case the ~1 indicates that the random effect will be associated with the intercept.

Once again we can use the function summary to explore our results:

` > summary(lme1) `

Linear mixed-effects model fit by REML

Data: dat

AIC BIC logLik

27648.36 27740.46 -13809.18

Random effects:

Formula: ~1 | rep

(Intercept) Residual

StdDev: 0.798407 13.3573

Fixed effects: yield ~ nf + bv * topo

Value Std.Error DF t-value p-value

(Intercept) 327.3304 14.782524 3428 22.143068 0

nfN1 3.9643 0.788049 3428 5.030561 0

nfN2 5.2340 0.790104 3428 6.624449 0

nfN3 5.4498 0.789084 3428 6.906496 0

nfN4 7.5286 0.789551 3428 9.535320 0

nfN5 7.7254 0.789111 3428 9.789976 0

bv -1.4685 0.085507 3428 -17.173569 0

topoHT -233.3675 17.143956 3428 -13.612232 0

topoLO -251.9750 20.967003 3428 -12.017693 0

topoW -146.4066 16.968453 3428 -8.628162 0

bv:topoHT 1.1945 0.097696 3428 12.226279 0

bv:topoLO 1.4961 0.123424 3428 12.121624 0

bv:topoW 0.7873 0.097865 3428 8.044485 0

We can also use the function Anovato display the ANOVA table:

` > Anova(lme2, type=c("III")) `

Analysis of Deviance Table (Type III tests)

Response: yield

Chisq Df Pr(>Chisq)

(Intercept) 752.25 1 < 2.2e-16 ***

nf 155.57 5 < 2.2e-16 ***

bv 291.49 1 < 2.2e-16 ***

topo 236.52 3 < 2.2e-16 ***

year 797.13 1 < 2.2e-16 ***

bv:topo 210.38 3 < 2.2e-16 ***

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

We might be interested in understanding if fitting a more complex model provides any advantage in terms of accuracy, compared with a model where no additional random effect is included. To do so we can compare this new model with mod6, which we created with the glsfunction and includes the same treatment structure. We can do that with the function anova, specifying both models:

` > anova(lme1, mod6) `

Model df AIC BIC logLik Test L.Ratio p-value

lme1 1 15 27648.36 27740.46 -13809.18

mod6 2 14 27651.21 27737.18 -13811.61 1 vs 2 4.857329 0.0275

As you can see there is a decrease in AIC for the model fitted with lme, and the difference is significant (p-value below 0.05). Therefore this new model where clustering is accounted for is better than the one without an additional random effect, even though only slightly. In this case we would need to decide if fitting a more complex model (which is probably more difficult to explain to readers) is the best way.

If we collected data at several time steps we are looking at a repeated measures analysis. The code to create such a model is the following:

` lme2 = lme(yield ~ nf + bv * topo + year, random= ~year|rep, data=dat) `

summary(lme2)

Anova(lme2, type=c("III"))

The syntax is very similar to what we wrote before, except that now the random component includes both time and clusters. Again we can use summary and Anovato get more info about the model. We can also use again the function anova to compare this with the previous model:

` > anova(lme1, lme2) `

Model df AIC BIC logLik Test L.Ratio p-value

lme1 1 15 27648.36 27740.46 -13809.18

lme2 2 18 26938.83 27049.35 -13451.42 1 vs 2 715.5247 <.0001

Warning message:

In anova.lme(lme1, lme2) :

fitted objects with different fixed effects. REML comparisons are not meaningful.

From this output it is clear that the new model is better that the one before and their difference in highly significant.

We can extract only the effects for the random components using the function ranef:

` > ranef(lme2) `

(Intercept) year

R1 -0.3468601 -1.189799e-07

R2 -0.5681688 -1.973702e-07

R3 0.9150289 3.163501e-07

This tells us the changes in yield for each cluster and time step. We can also do the same for the fixed effects, and this will return the coefficients of the model:

` > fixef(lme2) `

(Intercept) nfN1 nfN2 nfN3 nfN4

-1.133614e+04 3.918006e+00 5.132136e+00 5.368513e+00 7.464542e+00

nfN5 bv topoHT topoLO topoW

7.639337e+00 -1.318391e+00 -2.049979e+02 -2.321431e+02 -1.136168e+02

year bv:topoHT bv:topoLO bv:topoW

5.818826e+00 1.027686e+00 1.383705e+00 5.998379e-01

To have an idea of their confidence interval we can use the function intervals:

` > intervals(lme2, which = "fixed") `

Approximate 95% confidence intervals

Fixed effects:

lower est. upper

(Intercept) -1.214651e+04 -1.133614e+04 -1.052576e+04

nfN1 2.526139e+00 3.918006e+00 5.309873e+00

nfN2 3.736625e+00 5.132136e+00 6.527648e+00

nfN3 3.974809e+00 5.368513e+00 6.762216e+00

nfN4 6.070018e+00 7.464542e+00 8.859065e+00

nfN5 6.245584e+00 7.639337e+00 9.033089e+00

bv -1.469793e+00 -1.318391e+00 -1.166989e+00

topoHT -2.353450e+02 -2.049979e+02 -1.746508e+02

topoLO -2.692026e+02 -2.321431e+02 -1.950836e+02

topoW -1.436741e+02 -1.136168e+02 -8.355954e+01

year 5.414742e+00 5.818826e+00 6.222911e+00

bv:topoHT 8.547273e-01 1.027686e+00 1.200644e+00

bv:topoLO 1.165563e+00 1.383705e+00 1.601846e+00

bv:topoW 4.264933e-01 5.998379e-01 7.731826e-01

attr(,"label")

[1] "Fixed effects:"

As you remember, when we first introduced the simple linear model we defined a set of assumptions that need to be met to apply this model. We already talked about methods to deal with deviations from the assumption of independence, equality of variances and balanced designs and the fact that, particularly if our dataset is large, we may reach robust results even if our data are not perfectly normal. However, there are datasets for which the target variable has a completely different distribution from the normal, in these cases we need to change our modelling method and employ generalized linear models. Common scenarios where this model should be considered are for example researchers where the variable of interest is binary, for example presence or absence of a species, or where we are interested in modelling counts, for example the number of insects present in a particular location. In these cases, where the target variable is not continuous but rather discrete or categorical, the assumption of normality is usually not met. In this section we will focus on the two scenarios mentioned above, but GLM can be used to deal with data distributed in many different ways, and we will introduce how to deal with more general cases.

Data of this type, i.e. counts or rates, are characterized by the fact that their lower bound is always zero. This does not fit well with a normal linear model, where the regression line may well estimate negative values. For this type of variable we can employ a Poisson Regression, which fits the following model:

As you can see the equation is very similar to the standard linear model, the difference is that to insure that all Y are positive (since we cannot have negative values for count data) we are estimating the log of *Y*.

In R fitting this model is very easy. For this example we are going to use another dataset available in the package agridat, named beall.webworms, which represents counts of webworms in a beet field, with insecticide treatments:

` > dat = beall.webworms `

> str(dat)

'data.frame': 1300 obs. of 7 variables:

$ row : int 1 2 3 4 5 6 7 8 9 10 ...

$ col : int 1 1 1 1 1 1 1 1 1 1 ...

$ y : int 1 0 1 3 6 0 2 2 1 3 ...

$ block: Factor w/ 13 levels "B1","B10","B11",..: 1 1 1 1 1 6 6 6 6 6 ...

$ trt : Factor w/ 4 levels "T1","T2","T3",..: 1 1 1 1 1 1 1 1 1 1 ...

$ spray: Factor w/ 2 levels "N","Y": 1 1 1 1 1 1 1 1 1 1 ...

$ lead : Factor w/ 2 levels "N","Y": 1 1 1 1 1 1 1 1 1 1 ...

We can check the distribution of our data with the function hist:

` hist(dat$y, main="Histogram of Worm Count", xlab="Number of Worms") `

We are going to fit a simple model first to see how to interpret its results, and then compare it with a more complex model:

` pois.mod = glm(y ~ trt, data=dat, family=c("poisson")) `

Once again the function summary will show some useful details about this model:

` > summary(pois.mod) `

Call:

glm(formula = y ~ trt, family = c("poisson"), data = dat)

Deviance Residuals:

Min 1Q Median 3Q Max

-1.6733 -1.0046 -0.9081 0.6141 4.2771

Coefficients:

Estimate Std. Error z value Pr(>|z|)

(Intercept) 0.33647 0.04688 7.177 7.12e-13 ***

trtT2 -1.02043 0.09108 -11.204 < 2e-16 ***

trtT3 -0.49628 0.07621 -6.512 7.41e-11 ***

trtT4 -1.22246 0.09829 -12.438 < 2e-16 ***

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for poisson family taken to be 1)

Null deviance: 1955.9 on 1299 degrees of freedom

Residual deviance: 1720.4 on 1296 degrees of freedom

AIC: 3125.5

Number of Fisher Scoring iterations: 6

The first valuable information is related to the residuals of the model, which should be symmetrical as for any normal linear model. From this output we can see that minimum and maximum, as well as the first and third quartiles, are similar, so this assumption is confirmed. Then we can see that the variable trt (i.e. treatment factor) is highly significant for the model, with very low p-values. Its effect are all negative and referred to the first level T1, meaning for example that a change from T1 to T2 will decrease the count by 1.02. We can check this effect by estimating changes between T1 and T2 with the function predict, and the option newdata:

` > predict(pois.mod, newdata=data.frame(trt=c("T1","T2"))) `

1 2

0.3364722 -0.6839588

Another important piece of information are the Null and Residuals deviances, which allow us to compute the probability that this model is better than the Null hypothesis, which states that a constant (with no variables) model would be better. We can compute the p-value of the model with the following line:

` > 1-pchisq(deviance(pois.mod), df.residual(pois.mod)) `

[1] 1.709743e-14

This p-value is very low, meaning that this model fits the data well. However, it may not be the best possible model, and we can use the AIC parameter to compare it to other models. For example, we could include more variables:

` pois.mod2 = glm(y ~ block + spray*lead, data=dat, family=c("poisson")) `

How does this new model compare with the previous?

` > AIC(pois.mod, pois.mod2) `

df AIC

pois.mod 4 3125.478

pois.mod2 16 3027.438

As you can see the second model has a lower AIC, meaning that fits the data better than the first.

One of the assumptions of the Poisson distribution is that its mean and variance have the same value. We can check by simply comparing mean and variance of our data:

` > mean(dat$y) `

[1] 0.7923077

> var(dat$y)

[1] 1.290164

In cases such as this when the variance is larger than the mean (in this case we talk about overdispersed count data) we should employ different methods, for example a quasipoisson distribution:

` pois.mod3 = glm(y ~ trt, data=dat, family=c("quasipoisson")) `

The summary function provides us with the dispersion parameter, which for a Poisson distribution should be 1:

` > summary(pois.mod3) `

Call:

glm(formula = y ~ trt, family = c("quasipoisson"), data = dat)

Deviance Residuals:

Min 1Q Median 3Q Max

-1.6733 -1.0046 -0.9081 0.6141 4.2771

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 0.33647 0.05457 6.166 9.32e-10 ***

trtT2 -1.02043 0.10601 -9.626 < 2e-16 ***

trtT3 -0.49628 0.08870 -5.595 2.69e-08 ***

trtT4 -1.22246 0.11440 -10.686 < 2e-16 ***

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for quasipoisson family taken to be 1.35472)

Null deviance: 1955.9 on 1299 degrees of freedom

Residual deviance: 1720.4 on 1296 degrees of freedom

AIC: NA

Number of Fisher Scoring iterations: 6

Since the dispersion parameter is 1.35, we can conclude that our data are not terrible dispersed, so maybe a Poisson regression would still be appropriate for this dataset.

However, there are cases where the data are very overdispersed. In those cases, when we see that the distribution has lots of peaks we need to employ the negative binomial regression, with the function glm.nb available in the package MASS:

` library(MASS) `

NB.mod1 = glm.nb(y ~ trt, data=dat)

Another popular for of regression that can be tackled with GLM is the logistic regression, where the variable of interest is binary (0 and 1, presence and absence or any other binary outcome). In this case the regression model takes the following equation:

Again, the equation is identical to the standard linear model, but what we are computing from this model is the log of the probability that one of the two outcomes will occur.

For this example we are going to use another dataset available in the package agridatcalled johnson.blight, where the binary variable of interest is the presence or absence of blight (either 0 or 1) in potatoes:

` > dat = johnson.blight `

> str(dat)

'data.frame': 25 obs. of 6 variables:

$ year : int 1970 1971 1972 1973 1974 1975 1976 1977 1978 1979 ...

$ area : int 0 0 0 0 50 810 120 40 0 0 ...

$ blight : int 0 0 0 0 1 1 1 1 0 0 ...

$ rain.am : int 8 9 9 6 16 10 12 10 11 8 ...

$ rain.ja : int 1 4 6 1 6 7 12 4 10 9 ...

$ precip.m: num 5.84 6.86 47.29 8.89 7.37 ...

In R fitting this model is very easy:

` mod9 = glm(blight ~ rain.am, data=dat, family=binomial) `

we are now using the binomial distribution for a logistic regression. To check the model we can rely again on summary:

` > summary(mod9) `

Call:

glm(formula = blight ~ rain.am, family = binomial, data = dat)

Deviance Residuals:

Min 1Q Median 3Q Max

-1.9395 -0.6605 -0.3517 1.0228 1.6048

Coefficients:

Estimate Std. Error z value Pr(>|z|)

(Intercept) -4.9854 2.0720 -2.406 0.0161 *

rain.am 0.4467 0.1860 2.402 0.0163 *

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 34.617 on 24 degrees of freedom

Residual deviance: 24.782 on 23 degrees of freedom

AIC: 28.782

Number of Fisher Scoring iterations: 5

This table is very similar to the one created for count data, so a lot of the discussion above can be used here. The main difference is in the way we can interpret the coefficients, because we need to remember that here we are calculating the logit function of the probability, so 0.4467 (coefficient for rain.am) is not the actual probability associated with an increase in rain. However, what we can say by just looking at the coefficients is that rain has a positive effect on blight, meaning that more rain increases the chances of finding blight in potatoes.

To estimate probabilities we need to use the function predict:

` > predict(mod9, type="response") `

1 2 3 4 5 6 7

0.19598032 0.27590141 0.27590141 0.09070472 0.89680283 0.37328295 0.59273722

8 9 10 11 12 13 14

0.37328295 0.48214935 0.19598032 0.69466455 0.19598032 0.84754431 0.27590141

15 16 17 18 19 20 21

0.93143346 0.05998586 0.19598032 0.05998586 0.84754431 0.59273722 0.59273722

22 23 24 25

0.48214935 0.59273722 0.98109229 0.89680283

This calculates the probability associated with the values of rain in the dataset. To know the probability associated with new values of rain we can again use predict with the option newdata:

` >predict(mod9,newdata=data.frame(rain.am=c(15)),type="response") `

1

0.8475443

This tells us that when rain is equal to 15 we have 84% chances of finding blight (i.e. chances of finding 1) in potatoes.

To assess the accuracy of the model we can use two approaches, the first is based on the deviances listed in the summary. The Residual deviance compares this model with the one that fits the data perfectly. So if we calculate the following p-value (using the deviance and df in the summary table for residuals):

` > 1-pchisq(24.782, 23) `

[1] 0.3616226

We see that because it is higher than 0.05 we cannot reject that this model fits the data as well as the perfect model, therefore our model seems to be good.

We can repeat the same procedure for the Null hypothesis, which again tests whether this model fits the data well:

` > 1-pchisq(34.617, 24) `

[1] 0.07428544

Since this is again not significant it suggests (contrary to what we obtained before) that this model is not very good.

An additional and probably easier to understand way to assess the accuracy of a logistic model is calculating the pseudo R2, which can be done by installing the package pscl:

` install.packages("pscl") `

library(pscl)

Now we can run the following function:

` > pR2(mod9) `

llh llhNull G2 McFadden r2ML r2CU

-12.3910108 -17.3086742 9.8353268 0.2841155 0.3252500 0.4338984

From this we can see that our model explains around 30-40% of the variation in blight, which is not particularly good. We can use this index to compare models, as we did for AIC.

As mentioned, GLM can be used for fitting linear models not only in the two scenarios we described above, but in any occasion where data do not comply with the normality assumption. For example, we can look at another dataset available in agridat, where the variable of interest is slightly non-normal:

` > dat = hughes.grapes `

> str(dat)

'data.frame': 270 obs. of 6 variables:

$ block : Factor w/ 3 levels "B1","B2","B3": 1 1 1 1 1 1 1 1 1 1 ...

$ trt : Factor w/ 6 levels "T1","T2","T3",..: 1 2 3 4 5 6 1 2 3 4 ...

$ vine : Factor w/ 3 levels "V1","V2","V3": 1 1 1 1 1 1 1 1 1 1 ...

$ shoot : Factor w/ 5 levels "S1","S2","S3",..: 1 1 1 1 1 1 2 2 2 2 ...

$ diseased: int 1 2 0 0 3 0 7 0 1 0 ...

$ total : int 14 12 12 13 8 9 8 10 14 10 ...

The variable total presents a skewness of 0.73, which means that probably with a transformation it should fit with a normal distribution. However, for the sake of the discussion we will assume it cannot be transformed. So now our problem is identify the best distribution for our data, to do so we can use the function descdist in the package fitdistrplus we already loaded:

` descdist(dat$total, discrete = FALSE) `

this returns the following plot:

Where we can see that our data (blue dot) are close to normal and maybe closer to a gamma distribution. So now we can further check this using another function from the same package:

` plot(fitdist(dat$total,distr="gamma")) `

which creates the following plot:

From this we can see that in fact our data seem to be close to a gamma distribution, so now we can proceed with modelling:

` mod8 = glm(total ~ trt * vine, data=dat, family=Gamma(link=identity)) `

in the option family we included the name of the distribution, plus a link function that is used if we want to transform our data (in this case the function identity is for leaving data not transformed).

This is what we do to model other types of data that do not fit with a normal distribution. Other possible families supported by GLM are:

binomial, gaussian, Gamma, inverse.gaussian, poisson, quasi, quasibinomial, quasipoisson

Other possible link functions (which availability depends on the family) are:

logit, probit, cauchit, cloglog, identity, log, sqrt, 1/mu^2, inverse.

As linear model, linear mixed effects model need to comply with normality. If our data deviates too much we need to apply the generalized form, which is available in the package lme4:

` install.packages("lme4") `

library(lme4)

For this example we will use again the dataset johnson.blight:

` dat = johnson.blight `

Now we can fit a GLME model with random effects for area, and compare it with a model with only fixed effects:

` mod10 = glm(blight ~ precip.m, data=dat, family="binomial") `

mod11 = glmer(blight ~ precip.m + (1|area), data=dat, family="binomial")

> AIC(mod10, mod11)

df AIC

mod10 2 37.698821

mod11 3 9.287692

As you can see this new model reduces the AIC substantially.

The same function can be used for Poisson regression, but it does not work for quasipoisson overdispersed data. However, within lme4 there is the function glmer.nb for negative binomial mixed effect. The syntax is the same as glmer, except that in glmer.nb we do not need to include family.

Finch, W.H., Bolin, J.E. and Kelley, K., 2014. *Multilevel modeling using R*. Crc Press.

Yan, X. and Su, X., 2009. *Linear regression analysis: theory and computing*. World Scientific.

James, G., Witten, D., Hastie, T. and Tibshirani, R., 2013. *An introduction to statistical learning* (Vol. 6). New York: springer. http://www-bcf.usc.edu/~gareth/ISL/ISLR%20Sixth%20Printing.pdf

Long, J. Scott. 1997. *Regression Models for Categorical and Limited Dependent Variables*. Sage. pp104-106. [For pseudo R-Squared equations, page available on google books]

Webster, R. and Oliver, M.A., 2007. *Geostatistics for environmental scientists*. John Wiley & Sons.

West, B.T., Galecki, A.T. and Welch, K.B., 2014. *Linear mixed models*. CRC Press.

Gałecki, A. and Burzykowski, T., 2013. *Linear mixed-effects models using R: A step-by-step approach*. Springer Science & Business Media.

Williams, R., 2004. *One-Way Analysis of Variance*. URL: https://www3.nd.edu/~rwilliam/stats1/x52.pdf

Witte, R. and Witte, J. 2009. *Statistics. 9th ed. *Wiley.

There are several ways to check the accuracy of our models, some are printed directly in R within the summary output, others are just as easy to calculate with specific functions.

This is probably the most commonly used statistics and allows us to understand the percentage of variance in the target variable explained by the model. It can be computed as a ratio of the regression sum of squares and the total sum of squares. This is one of the standard measures of accuracy that R prints out through the function summary for linear models and ANOVAs.

This is a form of R-squared that is adjusted for the number of terms in the model. It can be computed as follows:

Where R2 is the R-squared of the model, n is the sample size and p is the number of terms (or predictors) in the model. This index is extremely useful to determine possible overfitting in the model. This happens particularly when the sample size is small, in such cases if we fill the model with predictors we may end up increasing the R-squared simply because the model starts adapting to the noise in the data and not properly describing the data.

The previous indexes measure the amount of variance in the target variable that can be explained by our model. This is a good indication but in some cases we are more interested in quantifying the error in the same measuring unit of the variable. In such cases we need to compute indexes that average the residuals of the model. The problem is the residuals are both positive and negative and their distribution should be fairly symmetrical. This means that their average will always be zero. So we need to find other indexes to quantify the average residuals, for example by averaging the squared residuals:

This is the square root of the mean the squared residuals, with being the estimated value at point t, being the observed value in t and being the sample size. The RMSE has the same measuring unit of the variable y.

This is simply the numerator of the previous equation, but it is not used often. The issue with both the RMSE and the MSE is that since they square the residuals they tend to be more affected by large residuals. This means that even if our model explains the large majority of the variation in the data very well, with few exceptions; these exceptions will inflate the value of RMSE.

To solve the problem with large residuals we can use the mean absolute error, where we average the absolute value of the residuals:

This index is more robust against large residuals. Since RMSE is still widely used, even though its problems are well known, it is always better calculate and present both in a research paper.

This index is another popular index we have used along the text to compare different models. It is very popular because it corrects the RMSE for the number of predictors in the model, thus allowing to account for overfitting. It can be simply computed as follows:

Where again p is the number of terms in the model.

To **leave a comment** for the author, please follow the link and comment on their blog: ** R tutorial for Spatial Statistics**.

R-bloggers.com offers

(This article was first published on ** S+/R – Yet Another Blog in Statistical Computing**, and kindly contributed to R-bloggers)

Below is an example showing how to fit a Generalized Linear Model with H2O in R. The output is much more comprehensive than the one generated by the generic R glm().

> library(h2o) > h2o.init(max_mem_size = "12g") > df1 <- h2o.uploadFile("Documents/credit_count.txt", header = TRUE, sep = ",", parse_type = "CSV") > df2 <- h2o.assign(df1[df1$CARDHLDR == 1, ], "glm_df") > h2o.colnames(df2) [1] "CARDHLDR" "DEFAULT" "AGE" "ACADMOS" "ADEPCNT" "MAJORDRG" [7] "MINORDRG" "OWNRENT" "INCOME" "SELFEMPL" "INCPER" "EXP_INC" [13] "SPENDING" "LOGSPEND" > Y <- "DEFAULT" > X <- c("MAJORDRG", "MINORDRG", "INCOME", "OWNRENT") > dist <- "binomial" > link <- "logit" > id <- "h2o_mdl01" > mdl <- h2o.glm(X, Y, training_frame = h2o.getFrame("glm_df"), model_id = id, family = dist, link = link, lambda = 0, compute_p_values = TRUE, standardize = FALSE) > show(h2o.getModel(id)@model$coefficients_table) Coefficients: glm coefficients names coefficients std_error z_value p_value 1 Intercept -1.204439 0.090811 -13.263121 0.000000 2 MAJORDRG 0.203135 0.069250 2.933370 0.003353 3 MINORDRG 0.202727 0.047971 4.226014 0.000024 4 OWNRENT -0.201223 0.071619 -2.809636 0.004960 5 INCOME -0.000442 0.000040 -10.942350 0.000000 > h2o.performance(h2o.getModel(id)) H2OBinomialMetrics: glm ** Reported on training data. ** MSE: 0.08414496 RMSE: 0.2900775 LogLoss: 0.3036585 Mean Per-Class Error: 0.410972 AUC: 0.6432189 Gini: 0.2864378 R^2: 0.02005004 Residual Deviance: 6376.221 AIC: 6386.221 Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold: 0 1 Error Rate 0 7703 1800 0.189414 =1800/9503 1 630 366 0.632530 =630/996 Totals 8333 2166 0.231451 =2430/10499 Maximum Metrics: Maximum metrics at their respective thresholds metric threshold value idx 1 max f1 0.126755 0.231499 142 2 max f2 0.075073 0.376556 272 3 max f0point5 0.138125 0.191828 115 4 max accuracy 0.368431 0.905039 0 5 max precision 0.314224 0.250000 3 6 max recall 0.006115 1.000000 399 7 max specificity 0.368431 0.999895 0 8 max absolute_mcc 0.126755 0.128940 142 9 max min_per_class_accuracy 0.106204 0.604546 196 10 max mean_per_class_accuracy 0.103730 0.605663 202

To **leave a comment** for the author, please follow the link and comment on their blog: ** S+/R – Yet Another Blog in Statistical Computing**.

R-bloggers.com offers