(This article was first published on ** R – Spatial.ly**, and kindly contributed to R-bloggers)

Here is a quick example for how to get started with some of the more sophisticated point pattern analysis tools that have been developed for ecologists – principally the `adehabitathr`

package – but that are very useful for human data. Ecologists deploy point pattern analysis to establish the “home range” of a particular animal based on the know locations it has been sighted (either directly or remotely via camera traps). Essentially it is where the animal spends most of its time. In the case of human datasets the analogy can be extended to identify areas where most crimes are committed – hotspots – or to identify the activity spaces of individuals or the catchment areas of services such as schools and hospitals.

This tutorial offers a rough analysis of crime data in London so the maps should not be taken as definitive – I’ve just used them as a starting point here.

Police crime data have been taken from here https://data.police.uk/data/. This tutorial London uses data from London’s Metropolitan Police (September 2017).

For the files used below click here.

```
#Load in the library and the csv file.
library("rgdal")
library(raster)
library(adehabitatHR)
input<- read.csv("2017-09-metropolitan-street.csv")
input<- input[,1:10] #We only need the first 10 columns
input<- input[complete.cases(input),] #This line of code removes rows with NA values in the data.
```

At the moment `input`

is a basic data frame. We need to convert the data frame into a spatial object. Note we have first specified our epsg code as 4326 since the coordinates are in WGS84. We then use `spTransform`

to reproject the data into British National Grid – so the coordinate values are in meters.

```
Crime.Spatial<- SpatialPointsDataFrame(input[,5:6], input, proj4string = CRS("+init=epsg:4326"))
Crime.Spatial<- spTransform(Crime.Spatial, CRS("+init=epsg:27700")) #We now project from WGS84 for to British National Grid
plot(Crime.Spatial) #Plot the data
```

The plot reveals that we have crimes across the UK, not just in London. So we need an outline of London to help limit the view. Here we load in a shapefile of the Greater London Authority Boundary.

`London<- readOGR(".", layer="GLA_outline")`

Thinking ahead we may wish to compare a number of density estimates, so they need to be performed across a consistently sized grid. Here we create an empty grid in advance to feed into the `kernelUD`

function.

```
Extent<- extent(London) #this is the geographic extent of the grid. It is based on the London object.
#Here we specify the size of each grid cell in metres (since those are the units our data are projected in).
resolution<- 500
#This is some magic that creates the empty grid
x <- seq(Extent[1],Extent[2],by=resolution) # where resolution is the pixel size you desire
y <- seq(Extent[3],Extent[4],by=resolution)
xy <- expand.grid(x=x,y=y)
coordinates(xy) <- ~x+y
gridded(xy) <- TRUE
#You can see the grid here (this may appear solid black if the cells are small)
plot(xy)
plot(London, border="red", add=T)
```

OK now run the density estimation note we use `grid= xy`

utlise the grid we just created. This is for all crime in London.

```
all <- raster(kernelUD(Crime.Spatial, h="href", grid = xy)) #Note we are running two functions here - first KernelUD then converting the result to a raster object.
#First results
plot(all)
plot(London, border="red", add=T)
```

Unsurprisingly we have a hotpot over the centre of London. Are there differences for specific crime types? We may, for example, wish to look at the density of burglaries.

```
plot(Crime.Spatial[Crime.Spatial$Crime.type=="Burglary",]) # quick plot of burglary points
burglary<- raster(kernelUD(Crime.Spatial[Crime.Spatial$Crime.type=="Burglary",], h="href", grid = xy))
plot(burglary)
plot(London, border="red", add=T)
```

There’s a slight difference but still it’s tricky to see if there are areas where burglaries concentrate more compared to the distribution of all crimes. A very rough way to do this is to divide one density grid by the other.

```
both<-burglary/all
plot(both)
plot(London, border="red", add=T)
```

This hasn’t worked particularly well since there are edge effects on the density grid that are causing issues due to a few stray points at the edge of the grid. We can solve this by capping the values we map – in this we are only showing values of between 0 and 1. Some more interesting structures emerge with burglary occuring in more residential areas, as expected.

```
both2 <- both
both2[both <= 0] <- NA
both2[both >= 1] <- NA
#Now we can see the hotspots much more clearly.
plot(both2)
plot(London, add=T)
```

There’s many more sophisticated approaches to this kind of analysis – I’d encourgae you to look at the `adehabitatHR`

vignette on the package page.

To **leave a comment** for the author, please follow the link and comment on their blog: ** R – Spatial.ly**.

R-bloggers.com offers

(This article was first published on ** INWT-Blog-RBloggers**, and kindly contributed to R-bloggers)

This is a reproduction of the (simple) bar plot of chapter 6.1.1 in Datendesign mit R with ggplot2. To download the data you can use the following lines:

`dir.create("data") writeLines("*", "data/.gitignore") download.file("http://www.datendesign-r.de/alle_daten.zip", "data/alle_daten.zip") unzip("data/alle_daten.zip", exdir = "data")`

And to download the original script and the base R version of this plot:

`download.file("http://www.datendesign-r.de/beispielcode.zip", "data/beispielcode.zip") unzip("data/beispielcode.zip", exdir = "data")`

After downloading check out the original pdf version of this plot in *data/beispielcode/pdf/balkendiagramm_einfach.pdf*.

Here are some steps to modify the data such that it can be easily used with ggplot2.

`# remember to adjust the path ipsos <- openxlsx::read.xlsx("../data/alle_daten/ipsos.xlsx") ipsos <- ipsos[order(ipsos$Wert),] ipsos$Land <- ordered(ipsos$Land, ipsos$Land) ipsos$textFamily <- ifelse(ipsos$Land %in% c("Deutschland","Brasilien"), "Lato Black", "Lato Light") ipsos$labels <- paste0(ipsos$Land, ifelse(ipsos$Wert < 10, " ", " "), ipsos$Wert) rect <- data.frame( ymin = seq(0, 80, 20), ymax = seq(20, 100, 20), xmin = 0.5, xmax = 16.5, colour = rep(c(grDevices::rgb(191,239,255,80,maxColorValue=255), grDevices::rgb(191,239,255,120,maxColorValue=255)), length.out = 5))`

First we add the geoms, then modifications to the scales and flip of the coordinate system. The remaining code is just modifying the appearance.

`library("ggplot2") ggBar <- ggplot(ipsos) + geom_bar(aes(x = Land, y = Wert), stat = "identity", fill = "grey") + geom_bar(aes(x = Land, y = ifelse(Land %in% c("Brasilien", "Deutschland"), Wert, NA)), stat = "identity", fill = rgb(255,0,210,maxColorValue=255)) + geom_rect(data = rect, mapping = aes(ymin = ymin, ymax = ymax, xmin = xmin, xmax = xmax), fill = rect$colour) + geom_hline(aes(yintercept = 45), colour = "skyblue3") + scale_y_continuous(breaks = seq(0, 100, 20), limits = c(0, 100), expand = c(0, 0)) + scale_x_discrete(labels = ipsos$labels) + coord_flip() + labs(y = NULL, x = NULL, title = NULL) + theme_minimal() + theme(panel.grid.minor = element_blank(), panel.grid.major = element_blank(), axis.ticks = element_blank(), axis.text.y = element_text( family = ipsos$textFamily), text = element_text(family = "Lato Light")) ggBar`

Of course you can simply add the title and text annotations to the plot using ggplot2, but I didn’t find a way to do the exact placement comparable to the original version without the package grid.

`library("grid") vp_make <- function(x, y, w, h) viewport(x = x, y = y, width = w, height = h, just = c("left", "bottom")) main <- vp_make(0.05, 0.05, 0.9, 0.8) title <- vp_make(0, 0.9, 0.6, 0.1) subtitle <- vp_make(0, 0.85, 0.4, 0.05) footnote <- vp_make(0.55, 0, 0.4, 0.05) annotation1 <- vp_make(0.7, 0.85, 0.225, 0.05) annotation2 <- vp_make(0.4, 0.85, 0.13, 0.05) # To see which space these viewports will use: grid.rect(gp = gpar(lty = "dashed")) grid.rect(gp = gpar(col = "grey"), vp = main) grid.rect(gp = gpar(col = "grey"), vp = title) grid.rect(gp = gpar(col = "grey"), vp = subtitle) grid.rect(gp = gpar(col = "grey"), vp = footnote) grid.rect(gp = gpar(col = "grey"), vp = annotation1) grid.rect(gp = gpar(col = "grey"), vp = annotation2)`

And now we can add the final annotations to the plot:

`# pdf_datei<-"balkendiagramme_einfach.pdf" # cairo_pdf(bg = "grey98", pdf_datei, width=9, height=6.5) grid.newpage() print(ggBar, vp = main)`

`grid.text("'Ich glaube fest an Gott oder ein höheres Wesen'", gp = gpar(fontfamily = "Lato Black", fontsize = 14), just = "left", x = 0.05, vp = title) grid.text("...sagten 2010 in:", gp = gpar(fontfamily = "Lato Light", fontsize = 12), just = "left", x = 0.05, vp = subtitle) grid.text("Quelle: www.ipsos-na.com, Design: Stefan Fichtel, ixtract", gp = gpar(fontfamily = "Lato Light", fontsize = 9), just = "right", x = 0.95, vp = footnote) grid.text("Alle Angaben in Prozent", gp = gpar(fontfamily = "Lato Light", fontsize = 9), just = "right", x = 1, y = 0.55, vp = annotation1) grid.text("Durchschnitt: 45", gp = gpar(fontfamily = "Lato Light", fontsize = 9), just = "right", x = 0.95, y = 0.55, vp = annotation2)`

`# dev.off()`

To **leave a comment** for the author, please follow the link and comment on their blog: ** INWT-Blog-RBloggers**.

R-bloggers.com offers

(This article was first published on ** fishR Blog**, and kindly contributed to R-bloggers)

A **fishR** user recently asked me

In the book that you published, I frequently use the stock-recruit curve code. The interface that shows both the Ricker/Beverton-Holt figure with the recruit per spawner to spawner figure (i.e., the dynamic plot for

`srStarts()`

) has not been working for quite some time. Additionally, I can get the recruits versus spawner plot for the Beverton-Holt or Ricker curve with confidence bounds around the curve, but how do you do the same for the recruit per spawner to spawner curve?

Below I answer both questions. First, however, I load required packages …

… and obtain the same data (in PSalmonAK.csv available from here) used in the Stock-Recruitment chapter of my **Introductory Fisheries Analyses with R** (IFAR) book. Note that I created two new variables here: `retperesc`

is the “recruits per spawner” and `logretperesc`

is the natural log of the recruits per spawner variable.

The first question about the dynamic plot is due to a change in functionality in the **FSA** package since **IFAR** was published. The `dynamicPlot=`

argument was removed because the code for that argument relied on the **tcltk** package, which I found difficult to reliably support. A similar, though more manual, approach is accomplished with the new `fixed=`

and `plot=`

arguments. For example, using `plot=TRUE`

(without `fixed=`

) will produces a plot of “recruits” versus “stock” with the chosen stock-recruitment model evaluated at the automatically chosen parameter starting values superimposed.

The user, however, can show the stock-recruitment model evaluated at manually chosen parameter starting values by including those starting values in a list that is supplied to `fixed=`

. These values can be iteratively changed in subsequent calls to `srStarts()`

to manually find starting values that provide a model that reasonably fits (by eye) the stock-recruit data.

Note however that `srStarts()`

no longer supports the simultaneously plotting of spawners versus recruits and recruits per spawner versus recruits.

The first way that I can imagine plotting recruits per spawners versus spawners with the fitted curve and confidence bands is to first follow the code for fitting the stock-recruit function to the stock and recruit data as described in **IFAR**. In this case, the stock-recruit function is fit on the log scale to adjust for a multiplicative error structure (as described in **IFAR**).

As described in the book, the plot of spawners versus recruits is made by (i) constructing a sequence of “x” values that span the range of observed numbers of spawners, (ii) predicting the number of recruits at each spawner value using the best-fit stock-recruitment model, (iii) constructing lower and upper confidence bounds for the predicted number of recruits at each spawner value with the bootstrap results, (iv) making a schematic plot on which to put (v) a polygon for the confidence band, (vi) the raw data points, and (vii) the best-fit curve. The code below follows these steps and reproduces Figure 12.4 in the book.

These results can be modified to plot recruits per spawner versus spawners by replacing the “recruits” in the code above with “recruits per spawner.” This is simple for the actual data as `ret`

is simply replaced with `retperesc`

. However, the predicted number of recruits (in `pR`

) and the confidence bounds (in `LCI`

and `UCI`

) from above must be divided by the number of spawners (in `x`

). As the `/`

symbol has a special meaning in R formulas, this division must be contained within `I()`

as when it appears in a formula (see the `lines()`

code below). Of course, the y-axis scale range must also be adjusted. Thus, a plot of recruits per spawner versus spawners is produced from the previous results with the following code.

Alternatively, the Ricker stock-recruitment model could be reparameterized by dividing each side of the function by “spawners” such that the right-hand-side becomes “recruits per spawner” (this is a fairly typical reparameterization of the model). This model can be put into an R function, with model parameters then estimated with nonlinear regression similar to above. The results below show that the paramter point estimates are identical and the bootsrapped confidence intervals are similar to what was obtained above.

With this, a second method for plotting recruits per spawner versus spawners is the same as how the main plot from the book was constructed but modified to use the results from this “new” model.

The two methods described above for plotting recruits per spawner versuse spawners are identical for the best-fit curve and nearly identical for the confidence bounds (slight differences likely due to the randomness inherent in bootstrapping). Thus, the two methods produce nearly the same visual.

To **leave a comment** for the author, please follow the link and comment on their blog: ** fishR Blog**.

R-bloggers.com offers

(This article was first published on ** R on The Jumping Rivers Blog**, and kindly contributed to R-bloggers)

It’s that time of year when we need to start thinking about what R Conferences we would like to (and can!) attend. To help plan your (ahem) work trips, we thought it would be useful to list the upcoming main attractions.

We maintain a list of upcoming rstats conferences. To keep up to date, just follow our twitter bot.

rstudio::conf is about all things R and RStudio

The next RStudio conference is held in San Diego, and boosts top quality speakers from around the R world. With excellent tutorials and great talks, this is certainly one of the top events. The estimated cost is around $200 per day, so not the cheapest options, but worthwhile.

- January 31, Feb 1-3: rstudio::conf. San Diego, USA.

An opportunity to hear from and network with top Researchers, Data Scientists and Developers from the R community in South Africa and beyond.

This workshop combines an amazing location, with great speakers and at an amazing price (only $70 per day). The key note speakers are Maëlle Salmon and Stephanie Kovalchik. This years SatRday has two tutorials, one on package building the other on sports modelling.

- March 17th: SatRday. Cape Town, South Africa.

The European R Users Meeting, eRum, is an international conference that aims at integrating users of the R language living in Europe. This conference is similar to useR!.

- May: The European #rstats Users Meeting. Budapest, Hungary. @erum2018

From the inaugural conference in 2009, the annual R/Finance conference in Chicago has become the primary meeting for academics and practioners interested in using R in Finance. Participants from academia and industry mingle for two days to exchange ideas about current research, best practices and applications. A single-track program permits continued focus on a series of refereed submissions. We hear there’s a lively social program rounds out the event.

- June 1-2: R/Finance 2018. Chicago, USA.

With useR! 2017 spanning over 5 days and boasting some of the biggest names in data science, the next installment of the useR! series is sure to be even better. Last year the program was packed with speakers, programmes and tutorials from industry and academia. Each day usually containing numerous tutorials and usually ends in a keynote speech followed by dinner(!). Registration is open in January 2018.

- July 10-13: useR! 2018. Brisbane, Australia.

Born on Twitter, the Noreast’R Conference is a grass roots effort to organize a regional #rstats conference in the Northeastern United States. Work is currently under way and further details will follow on the Noreast’R website and twitter page (linked below).

To **leave a comment** for the author, please follow the link and comment on their blog: ** R on The Jumping Rivers Blog**.

R-bloggers.com offers

(This article was first published on ** R – Decision Science News**, and kindly contributed to R-bloggers)

BAGEL SHOP IDEA

I was sitting in a bagel shop on Saturday with my 9 year old daughter. We had brought along hexagonal graph paper and a six sided die. We decided that we would choose a hexagon in the middle of the page and then roll the die to determine a direction:

1 up (North)

2 diagonal to the upper right (Northeast)

3 diagonal to the lower right (Southeast)

4 down (South)

5 diagonal to the lower left (Southwest)

6 diagonal to the upper left (Northwest)

Our first roll was a six so we drew a line to the hexagon northwest of where we started. That was the first “step.”

After a few rolls we found ourselves coming back along a path we had gone down before. We decided to draw a second line close to the first in those cases.

We did this about 50 times. The results are pictured above, along with kid hands for scale.

I sent the picture to my friend and serial co-author Jake Hofman because he likes a good kid’s science project and has a mental association for everything in applied math. He wrote “time for some Brownian motion?” and sent a talk he’d given a decade ago at a high school which taught me all kind of stuff I didn’t realize connecting random walks to Brownian motion, Einstein, the existence of atoms and binomial pricing trees in finance. (I was especially embarrassed not to have mentally connected random walks to binomial pricing because I had a slide on that in my job talk years ago and because it is the method we used in an early distribution builder paper.)

Back at home Jake did some simulations on random walks in one dimension (in which you just go forward or backward with equal probability) and sent them to me. Next, I did the same with hexagonal random walks (code at the end of this post). Here’s an image of one random walk on a hexagonal field.

I simulated 5000 random walks of 250 steps, starting at the point 0,0. The average X and Y position is 0 at each step, as shown here.

This might seem strange at first. But think about many walks of just one step. The number of one-step journeys in which your X position is increased a certain amount will be matched, in expectation, by an equal number of one-step journeys in which your X position is decreased by the same amount. Your average X position is thus 0 at the first step. Same is true for Y. The logic scales when you take two or more steps and that’s why we see the flat lines we do.

If you think about this wrongheadedly you’d think you weren’t getting anywhere. But of course you are. Let’s look at your average distance from the starting point at each step (below).

The longer you walk, the more distant from the starting point you tend to be. Because distances are positive, the average of those distances is positive. We say you “tend to” move away from the origin at each step, because that is what happens on average over many trips. At any given step on any given trip, you could move towards or away from the starting point with equal probability. This is deep stuff.

Speaking of deep stuff, you might notice that the relationship is pretty. Let’s zoom in.

The dashed line is the square root of the number of steps. It’s interesting to note that this square root relationship happens in a one-dimensional random walk as well. There’s a good explanation of it in this document. As Jake put it, it’s as if the average walk is covered by a circular plate whose area grows linearly with the number of steps. (Why linearly? Because area of a circle is proportional to its radius squared. Since the radius grows as the square root of the number of steps, the radius squared is linear in the number of steps)

(*) As a sidenote, I was at first seeing something that grew more slowly than the square root and couldn’t figure out what the relationship was. It turns out that the square root relationship holds for the root mean squared distance (the mean of the squared distances) and I had been looking at the mean Euclidean distance. It’s a useful reminder that the term “average” has quite a few definitions. “Average” is a useful term for getting the gist across, but can lead to some confusion.

Speaking of gists, here’s the R code. Thanks to @hadleywickham for creating the tidyverse and making everything awesome.

RCODE

The post Random walking appeared first on Decision Science News.

To **leave a comment** for the author, please follow the link and comment on their blog: ** R – Decision Science News**.

R-bloggers.com offers

(This article was first published on ** Revolutions**, and kindly contributed to R-bloggers)

A quick personal note: today is my first day as a member of the Cloud Developer Advocates team at Microsoft! I'll still be blogging and events related to R, and supporting the R community, but now I'll be doing it as a member of a team dedicated to community outreach.

As a bit of background, when I joined Microsoft back in 2015 via the acquisition of Revolution Analytics, I was thrilled to be able to continue my role supporting the R community. Since then, Microsoft as a whole has continue to ramp up its support of open source projects and to interact directly with developers of all stripes (including data scientists!) through various initiatives across the company. (Aside: I knew Microsoft was a big company before I joined, but even then took me a while to appreciate the scale of the different divisions, groups, and geographies. For me, it was a bit like moving into a new city in the sense that it takes a while to learn what the neighborhoods are and how to find your way around.)

I learned of the Cloud Developer Advocates group after reading a RedMonk blog post about how some prominent techies (many of whom I've long admired and respected on Twitter) had been recruited into Microsoft to engage directly with developers. So when I learned that there was an entire group at Microsoft devoted to community outreach, and with such a fantastic roster already on board (and still growing!), I knew I had to be a part of it. I'll be working with a dedicated team (including Paige, Seth and Vadim) focused on data science, machine learning, and AI. As I mentioned above, I'll still be covering R, but will also be branching out into some other areas as well.

Stay tuned for more, and please hit me up on Twitter or via email if you'd like to chat, want to connect at an event, or just let me know how I can help. Let's keep the conversation going!

To **leave a comment** for the author, please follow the link and comment on their blog: ** Revolutions**.

R-bloggers.com offers

(This article was first published on ** R – insightR**, and kindly contributed to R-bloggers)

**By Gabriel Vasconcelos**

So I decided to have a quick look at the tuber package to extract YouTube data in R. My cousin is a singer (a hell of a good one) and he has a YouTube channel (dan vasc), which I strongly recommend, where he posts his covers. I will focus my analysis on his channel. The tuber package is very friendly and it downloads YouTube statistics on comments, views, likes and more straight to R using the YouTube API.

First let us look on some general information on the channel (Codes for replication in the end of the text). The table below shows the number of followers, views, videos, etc in the moment I downloaded the data (2017-12-12 11:20pm). If you run the code on your computer the results may be different because the channel will have more activity. Dan’s channel is getting close to 1 million views and he has 58 times more likes than dislikes. His views ratio is 13000 views per video.

Channel | Subscriptions | Views | Videos | Likes | Dislikes | Comments |
---|---|---|---|---|---|---|

Dan Vasc | 5127 | 743322 | 57 | 9008 | 155 | 1993 |

We can also see some of the same statistics for each video. I selected only videos published after January 2016 that is when the channel became more active. The list has 29 videos. You can see that the channel became even more active in 2017. In the last month it started with weekly publications.

date | title | viewCount | likeCount | dislikeCount | commentCount |
---|---|---|---|---|---|

2016-03-09 | “Heart Of Steel” – MANOWAR cover | 95288 | 1968 | 53 | 371 |

2016-05-09 | “The Sound Of Silence” – SIMON & GARFUNKEL / DISTURBED cover | 13959 | 556 | 6 | 85 |

2016-07-04 | One Man Choir – Handel’s Hallelujah | 9390 | 375 | 6 | 70 |

2016-08-16 | “Carry On” – MANOWAR cover | 19146 | 598 | 12 | 98 |

2016-09-12 | “You Are Loved (Don’t Give Up)” – JOSH GROBAN cover | 2524 | 142 | 0 | 21 |

2016-09-26 | “Hearts On Fire” – HAMMERFALL cover | 6584 | 310 | 4 | 58 |

2016-10-26 | “Dawn Of Victory” – RHAPSODY OF FIRE cover | 10335 | 354 | 5 | 69 |

2017-04-28 | “I Don’t Wanna Miss A Thing” – AEROSMITH cover | 9560 | 396 | 5 | 89 |

2017-05-09 | State of affairs | 906 | 99 | 1 | 40 |

2017-05-26 | “Cha-La Head Cha-La” – DRAGON BALL Z INTRO cover (Japanese) | 2862 | 160 | 4 | 39 |

2017-05-26 | “Cha-La Head Cha-La” – DRAGON BALL Z INTRO cover (Português) | 3026 | 235 | 3 | 62 |

2017-05-26 | “Cha-La Head Cha-La” – DRAGON BALL Z INTRO cover (English) | 2682 | 108 | 2 | 14 |

2017-06-14 | HOW TO BE A YOUTUBE SINGER | ASKDANVASC 01 | 559 | 44 | 1 | 19 |

2017-06-17 | Promotional Live || Q&A and video games | 206 | 16 | 0 | 2 |

2017-07-18 | “The Bard’s Song” – BLIND GUARDIAN cover (SPYGLASS INN project) | 3368 | 303 | 2 | 47 |

2017-07-23 | “Numb” – LINKIN PARK cover (R.I.P. CHESTER) | 6717 | 350 | 14 | 51 |

2017-07-27 | THE PERFECT TAKE and HOW TO MIX VOCALS | ASKDANVASC 02 | 305 | 29 | 0 | 11 |

2017-08-04 | 4000 Subscribers and Second Channel | 515 | 69 | 1 | 23 |

2017-08-10 | “Hello” – ADELE cover [ROCK VERSION] | 6518 | 365 | 2 | 120 |

2017-08-27 | “The Rains Of Castamere” (The Lannister Song) – GAME OF THRONES cover | 2174 | 133 | 5 | 28 |

2017-08-31 | “Africa” – TOTO cover | 18251 | 642 | 10 | 172 |

2017-09-24 | “Chop Suey!” – SYSTEM OF A DOWN cover | 2562 | 236 | 6 | 56 |

2017-10-09 | “An American Trilogy” – ELVIS PRESLEY cover | 1348 | 168 | 1 | 48 |

2017-11-08 | “Beauty And The Beast” – Main Theme cover | Feat. Alina Lesnik | 2270 | 192 | 2 | 59 |

2017-11-16 | “Bohemian Rhapsody” – QUEEN cover | 2589 | 339 | 3 | 95 |

2017-11-23 | “The Phantom Of The Opera” – NIGHTWISH/ANDREW LLOYD WEBBER cover | Feat. Dragica | 1857 | 209 | 2 | 42 |

2017-11-24 | “Back In Black” – AC/DC cover (RIP MALCOLM YOUNG) | 2202 | 207 | 2 | 56 |

2017-11-30 | “Immigrant Song” – LED ZEPPELIN cover | 3002 | 204 | 1 | 62 |

2017-12-07 | “Sweet Child O’ Mine” – GUNS N’ ROSES cover | 1317 | 201 | 2 | 86 |

Now that we saw the data. Let’s explore it to check for structures and information. The plots below show how likes, dislikes and comments are related to views. The positive relation is obvious. However, we have some degree of nonlinearity in likes and comments. The increment on likes and comments becomes smaller as the views increase. The dislikes look more linear on the views but the number of dislikes is to small to be sure.

Another interesting information is how comments are distributed over time in each video. I selected the four most recent videos and plotted the comments time-series below. All videos have a lot of activity in the first days but it decreases fast a few days latter. Followers and subscribers probably see the videos first and they must be responsible for the intense activity in the beginning of each plot.

The most important information might be how the channel grows over the time. Dan’s channel had two important moments in 2017. It became much more active in April and it started having weekly publications in November. We can clearly see that both strategies worked in the plot below. I put two dashed lines to show these two events. In April the number of comments increased a lot and they increased even more in November.

Finally, let’s have a look at what is in the comments using a WordCloud (wordcloud package). I removed words that are not informative such as “you, was, is, were” for English and Portuguese. The result is just below.

Before using the tuber package you need an ID and a password from Google Developer Console. Click here for more information. If you are interested, the package tubern has some other tools to work with YouTube data such as generating reports.

library(tuber) library(tidyverse) library(lubridate) library(stringi) library(wordcloud) library(gridExtra) httr::set_config( config( ssl_verifypeer = 0L ) ) # = Fixes some certificate problems on linux = # # = Autentication = # yt_oauth("ID", "PASS",token = "") # = Download and prepare data = # # = Channel stats = # chstat = get_channel_stats("UCbZRdTukTCjFan4onn04sDA") # = Videos = # videos = yt_search(term="", type="video", channel_id = "UCbZRdTukTCjFan4onn04sDA") videos = videos %>% mutate(date = as.Date(publishedAt)) %>% filter(date > "2016-01-01") %>% arrange(date) # = Comments = # comments = lapply(as.character(videos$video_id), function(x){ get_comment_threads(c(video_id = x), max_results = 1000) }) # = Prep the data = # # = Video Stat Table = # videostats = lapply(as.character(videos$video_id), function(x){ get_stats(video_id = x) }) videostats = do.call(rbind.data.frame, videostats) videostats$title = videos$title videostats$date = videos$date videostats = select(videostats, date, title, viewCount, likeCount, dislikeCount, commentCount) %>% as.tibble() %>% mutate(viewCount = as.numeric(as.character(viewCount)), likeCount = as.numeric(as.character(likeCount)), dislikeCount = as.numeric(as.character(dislikeCount)), commentCount = as.numeric(as.character(commentCount))) # = General Stat Table = # genstat = data.frame(Channel="Dan Vasc", Subcriptions=chstat$statistics$subscriberCount, Views = chstat$statistics$viewCount, Videos = chstat$statistics$videoCount, Likes = sum(videostats$likeCount), Dislikes = sum(videostats$dislikeCount), Comments = sum(videostats$commentCount)) # = videostats Plot = # p1 = ggplot(data = videostats[-1, ]) + geom_point(aes(x = viewCount, y = likeCount)) p2 = ggplot(data = videostats[-1, ]) + geom_point(aes(x = viewCount, y = dislikeCount)) p3 = ggplot(data = videostats[-1, ]) + geom_point(aes(x = viewCount, y = commentCount)) grid.arrange(p1, p2, p3, ncol = 2) # = Comments TS = # comments_ts = lapply(comments, function(x){ as.Date(x$publishedAt) }) comments_ts = tibble(date = as.Date(Reduce(c, comments_ts))) %>% group_by(date) %>% count() ggplot(data = comments_ts) + geom_line(aes(x = date, y = n)) + geom_smooth(aes(x = date, y = n), se = FALSE) + ggtitle("Comments by day")+ geom_vline(xintercept = as.numeric(as.Date("2017-11-08")), linetype = 2,color = "red")+ geom_vline(xintercept = as.numeric(as.Date("2017-04-28")), linetype = 2,color = "red") # = coments by video = # selected = (nrow(videostats) - 3):nrow(videostats) top4 = videostats$title[selected] top4comments = comments[selected] p = list() for(i in 1:4){ df = top4comments[[i]] df$date = as.Date(df$publishedAt) df = df %>% arrange(date) %>% group_by(year(date), month(date), day(date)) %>% count() df$date = make_date(df$`year(date)`, df$`month(date)`,df$`day(date)`) p[[i]] = ggplot(data=df) + geom_line(aes(x = date, y = n)) + ggtitle(top4[i]) } do.call(grid.arrange,p) ## = WordClouds = ## comments_text = lapply(comments,function(x){ as.character(x$textOriginal) }) comments_text = tibble(text = Reduce(c, comments_text)) %>% mutate(text = stri_trans_general(tolower(text), "Latin-ASCII")) remove = c("you","the","que","and","your","muito","this","that","are","for","cara", "from","very","like","have","voce","man","one","nao","com","with","mais", "was","can","uma","but","ficou","meu","really","seu","would","sua","more", "it's","it","is","all","i'm","mas","como","just","make","what","esse","how", "por","favor","sempre","time","esta","every","para","i've","tem","will", "you're","essa","not","faz","pelo","than","about","acho","isso", "way","also","aqui","been","out","say","should","when","did","mesmo", "minha","next","cha","pra","sei","sure","too","das","fazer","made", "quando","ver","cada","here","need","ter","don't","este","has","tambem", "una","want","ate","can't","could","dia","fiquei","num","seus","tinha","vez", "ainda","any","dos","even","get","must","other","sem","vai","agora","desde", "dessa","fez","many","most","tao","then","tudo","vou","ficaria","foi","pela", "see","teu","those","were") words = tibble(word = Reduce(c, stri_extract_all_words(comments_text$text))) %>% group_by(word) %>% count() %>% arrange(desc(n)) %>% filter(nchar(word) >= 3) %>% filter(n > 10 & word %in% remove == FALSE) set.seed(3) wordcloud(words$word, words$n, random.order = FALSE, random.color = TRUE, rot.per = 0.3, colors = 1:nrow(words))

To **leave a comment** for the author, please follow the link and comment on their blog: ** R – insightR**.

R-bloggers.com offers

(This article was first published on ** poissonisfish**, and kindly contributed to R-bloggers)

Statistical models generally assume that

- All observations are independent from each other
- The distribution of the residuals follows , irrespective of the values taken by the dependent variable
*y*

When any of the two is not observed, more sophisticated modelling approaches are necessary. Let’s consider two hypothetical problems that violate the two respective assumptions, where *y* denotes the dependent variable:

**A.** Suppose you want to study the relationship between average income (*y*) and the educational level in the population of a town comprising four fully segregated blocks. You will sample 1,000 individuals irrespective of their blocks. If you model as such, you neglect dependencies among observations – individuals from the same block are not independent, yielding residuals that correlate within block.

**B.** Suppose you want to study the relationship between anxiety (*y*) and the levels of triglycerides and uric acid in blood samples from 1,000 people, measured 10 times in the course of 24 hours. If you model as such, you will likely find that the variance of *y *changes over time – this is an example of heteroscedasticity, a phenomenon characterized by the heterogeneity in the variance of the residuals.

In **A.** we have a problem of dependency caused by spatial correlation, whereas in **B.** we have a problem of heterogeneous variance. As a result, classic linear models cannot help in these hypothetical problems, but both can be addressed using linear mixed-effect models (LMMs). In rigour though, you do not need LMMs to address the second problem.

LMMs are extraordinarily powerful, yet their complexity undermines the appreciation from a broader community. LMMs dissect hierarchical and / or longitudinal (*i.e.* time course) data by separating the variance due to random sampling from the main effects. In essence, on top of the fixed effects normally used in classic linear models, LMMs resolve *i*) correlated residuals by introducing random effects that account for differences among random samples, and *ii*) heterogeneous variance using specific variance functions, thereby improving the estimation accuracy and interpretation of fixed effects in one go.

I personally reckon that most relevant textbooks and papers are hard to grasp for non-mathematicians. Therefore, following the brief reference in my last post on GWAS I will dedicate the present tutorial to LMMs. For further reading I highly recommend the ecology-oriented Zuur *et al*. (2009) and the R-intensive Gałecki *et al*. (2013) books, and this simple tutorial from Bodo Winter. For agronomic applications, H.-P. Piepho *et al*. (2003) is an excellent theoretical introduction.

Here, we will build LMMs using the Arabidopsis dataset from the package `lme4`

, from a study published by Banta *et al*. (2010). These data summarize variation in total fruit set per plant in *Arabidopsis thaliana* plants conditioned to fertilization and simulated herbivory. Our goal is to understand the effect of fertilization and simulated herbivory adjusted to experimental differences across groups of plants.

Whereas the classic linear model with *n* observational units and *p* predictors has the vectorized form

with the predictor matrix , the vector of *p* + 1 coefficient estimates and the *n*-long vectors of the response and the residuals , LMMs additionally accomodate separate variance components modelled with a set of random effects ,

where and are design matrices that jointly represent the set of predictors. Random effects models include only an intercept as the fixed effect and a defined set of random effects. Random effects comprise random intercepts and / or random slopes. Also, random effects might be crossed and nested. In terms of estimation, the classic linear model can be easily solved using the least-squares method. For the LMM, however, we need methods that rather than estimating predict , such as maximum likelihood (ML) and restricted maximum likelihood (REML). Bear in mind that unlike ML, REML assumes that the fixed effects are not known, hence it is comparatively unbiased (see Chapter 5 in Zuur *et al*. (2009) for more details). Unfortunately, LMMs too have underlying assumptions – **both residuals and random effects should be normally distributed**. Residuals in particular should also have a uniform variance over different values of the dependent variable, exactly as assumed in a classic linear model.

One of the most common doubts concerning LMMs is determining whether a variable is a random or fixed. First of all, an effect might be fixed, random or even both simultaneously – it largely depends on how you approach a given problem. Generally, you should consider all factors that qualify as sampling from a population as random effects (*e.g.* individuals in repeated measurements, cities within countries, field trials, plots, blocks, batches) and everything else as fixed. As a rule of thumb, *i*) factors with fewer than 5 levels should be considered fixed and conversely *ii*) factors with numerous levels should be considered random effects in order to increase the accuracy in the estimation of variance. Both points relate to the LMM assumption of having normally distributed random effects.

Best linear unbiased estimators (BLUEs) and predictors (BLUPs) correspond to the values of fixed and random effects, respectively. The usage of the so-called genomic BLUPs (GBLUPs), for instance, elucidates the genetic merit of animal or plant genotypes that are regarded as random effects when trial conditions, *e.g.* location and year of trials are considered fixed. However, many studies sought the opposite, *i.e.* using breeding values as fixed effects and trial conditions as random, when the levels of the latter outnumber the former, chiefly because of point *ii*) outlined above. In GWAS, LMMs aid in teasing out population structure from the phenotypic measures.

We will follow a structure similar to the 10-step protocol outlined in Zuur *et al*. (2009): *i*) fit a full ordinary least squares model and run the diagnostics in order to understand if and what is faulty about its fit; *ii*) fit an identical generalized linear model (GLM) estimated with ML, to serve as a reference for subsequent LMMs; *iii*) deploy the first LMM by introducing random effects and compare to the GLM, optimize the random structure in subsequent LMMs; *iv*) optimize the fixed structure by determining the significant of fixed effects, always using ML estimation; finally, *v*) use REML estimation on the optimal model and interpret the results.

You need to have`nlme`

** **and`lme4`

** **installed to proceed. We will firstly examine the structure of the Arabidopsis dataset.

# Install (if necessary) and load nlme and lme4 library(nlme) library(lme4) # Load dataset, inspect size and additional info data(Arabidopsis) dim(Arabidopsis) # 625 observations, 8 variables ?Arabidopsis attach(Arabidopsis)

The Arabidopsis dataset describes 625 plants with respect to the the following 8 variables (transcript from R):

`reg`

- region: a factor with 3 levels
`NL`

(Netherlands),`SP`

(Spain),`SW`

(Sweden) `popu`

- population: a factor with the form
`n.R`

representing a population in region`R`

`gen`

- genotype: a factor with 24 (numeric-valued) levels
`rack`

- a nuisance factor with 2 levels, one for each of two greenhouse racks
`nutrient`

- fertilization treatment/nutrient level (1, minimal nutrients or 8, added nutrients)
`amd`

- simulated herbivory or “clipping” (apical meristem damage):
`unclipped`

(baseline) or`clipped`

`status`

- a nuisance factor for germination method (
`Normal`

,`Petri.Plate`

, or`Transplant`

) `total.fruits`

- total fruit set per plant (integer), henceforth TFPP for short.

We will now visualise the absolute frequencies in all 7 factors and the distribution for TFPP.

# Overview of the variables par(mfrow = c(2,4)) barplot(table(reg), ylab = "Frequency", main = "Region") barplot(table(popu), ylab = "Frequency", main = "Population") barplot(table(gen), ylab = "Frequency", las = 2, main = "Genotype") barplot(table(rack), ylab = "Frequency", main = "Rack") barplot(table(nutrient), ylab = "Frequency", main = "Nutrient") barplot(table(amd), ylab = "Frequency", main = "AMD") barplot(table(status), ylab = "Frequency", main = "Status") hist(total.fruits, col = "grey", main = "Total fruits", xlab = NULL)

The frequencies are overall balanced, perhaps except for `status`

(*i.e.* germination method). Genotype, greenhouse rack and fertilizer are incorrectly interpreted as quantitative variables. In addition, the distribution of TFPP is right-skewed. As such, we will encode these three variables as categorical variables and log-transform TFPP to approximate a Gaussian distribution (natural logarithm).

# Transform the three factor variables gen, rack and nutrient Arabidopsis[,c("gen","rack","nutrient")] <- lapply(Arabidopsis[,c("gen","rack","nutrient")], factor) str(Arabidopsis) # Re-attach after correction, ignore warnings attach(Arabidopsis) # Add 1 to total fruits, otherwise log of 0 will prompt error total.fruits <- log(1 + total.fruits)

A closer look into the variables shows that each genotype is exclusive to a single region. The data contain no missing values.

# gen x popu table table(gen, popu) # Any NAs? any(is.na(Arabidopsis)) # FALSE

At this point I hope you are familiar with the formula syntax in R. Note that interaction terms are denoted by `:`

and fully crossed effects with `*`

,** **so that `A*B = A + B + A:B`

. The following code example

lm(y ~ x1 + x2*x3)

builds a linear model of *y *using , , and the interaction between and . In case you want to perform arithmetic operations inside the formula, use the function `I`

. You can also introduce polynomial terms with the function `poly`

. One handy trick I use to expand all pairwise interactions among predictors is

model.matrix(y ~ .*., data = X)

provided a matrix *X* that gathers all predictors and *y. *You can also simply use `.*.`

inside the `lm`

call, however you will likely need to preprocess the resulting interaction terms.

While the syntax of `lme`

** **is identical to `lm`

for fixed effects, its random effects are specified under the argument** **`random`

as

random = ~intercept + fixed effect | random effect

and can be nested using `/`

. In the following example

random = ~1 + C | A/B

the random effect B is nested within random effect A, altogether with random intercept and slope with respect to C. Therefore, not only will the groups defined by A and A/B have different intercepts, they will also be explained by different slight shifts of from the fixed effect C.

Ideally, you should start will a full model (*i.e.* including all independent variables). Here, however, we cannot use all descriptors in the classic linear model since the fit will be singular due to the redundancy in the levels of `reg`

and `popu`

. For simplicity I will exclude these alongside `gen`

, since it contains a lot of levels and also represents a random sample (from many other extant Arabidopsis genotypes). Additionally, I would rather use `rack`

and ** **`status`

as random effects in the following models but note that having only two and three levels respectively, it is advisable to keep them as fixed.

LM <- lm(total.fruits ~ rack + nutrient + amd + status) summary(LM) par(mfrow = c(2,2)) plot(LM)

These diagnostic plots show that the residuals of the classic linear model poorly qualify as normally distributed. Because we have no obvious outliers, the leverage analysis provides acceptable results. We will try to improve the distribution of the residuals using LMMs.

We need to build a GLM as a benchmark for the subsequent LMMs. This model can be fit without random effects, just like a `lm`

but employing ML or REML estimation, using the `gls`

function. Hence, it can be used as a proper null model with respect to random effects. The GLM is also sufficient to tackle heterogeneous variance in the residuals by leveraging different types of variance and correlation functions, when no random effects are present (see arguments `correlation`

and `weights`

).

GLM <- gls(total.fruits ~ rack + nutrient + amd + status, method = "ML") summary(GLM)

At this point you might consider comparing the GLM and the classic linear model and note they are identical. Also, you might wonder why are we using LM instead of REML – as hinted in the introduction, REML comparisons are meaningless in LMMs that differ in their fixed effects. Therefore, we will base all of our comparisons on LM and only use the REML estimation on the final, optimal model.

Let’s fit our first LMM with all fixed effects used in the GLM and introducing `reg`

, `popu`

, `gen`

, `reg/popu`

, `reg/gen`

, `popu/gen`

and `reg/popu/gen`

as random intercepts, separately.

In order to compare LMMs (and GLM), we can use the function `anova`

** **(note that it does not work for `lmer`

objects) to compute the likelihood ratio test (LRT). This test will determine if the models are significantly different with respect to goodness-of-fit, as weighted by the trade-off between variance explained and degrees-of-freedom. The model fits are also evaluated based on the Akaike (AIC) and Bayesian information criteria (BIC) – the smaller their value, the better the fit.

lmm1 <- lme(total.fruits ~ rack + nutrient + amd + status, random = ~1|reg, method = "ML") lmm2 <- lme(total.fruits ~ rack + nutrient + amd + status, random = ~1|popu, method = "ML") lmm3 <- lme(total.fruits ~ rack + nutrient + amd + status, random = ~1|gen, method = "ML") lmm4 <- lme(total.fruits ~ rack + nutrient + amd + status, random = ~1|reg/popu, method = "ML") lmm5 <- lme(total.fruits ~ rack + nutrient + amd + status, random = ~1|reg/gen, method = "ML") lmm6 <- lme(total.fruits ~ rack + nutrient + amd + status, random = ~1|popu/gen, method = "ML") lmm7 <- lme(total.fruits ~ rack + nutrient + amd + status, random = ~1|reg/popu/gen, method = "ML") anova(GLM, lmm1, lmm2, lmm3, lmm4, lmm5, lmm6, lmm7)

We could now base our selection on the AIC, BIC or log-likelihood. Considering most models are undistinguishable with respect to the goodness-of-fit, I will select `lmm6`

and `lmm7`

as the two best models so that we have more of a random structure to look at. Take a look into the distribution of the random effects with `plot(ranef(MODEL))`

. We next proceed to incorporate random slopes.

There is the possibility that the different researchers from the different regions might have handled and fertilized plants differently, thereby exerting slightly different impacts. Let’s update `lmm6`

and `lmm7`

to include random slopes with respect to `nutrient`

. We first need to setup a control setting that ensures the new models converge.

# Set optimization pars ctrl <- lmeControl(opt="optim") lmm6.2 <- update(lmm6, .~., random = ~nutrient|popu/gen, control = ctrl) lmm7.2 <- update(lmm7, .~., random = ~nutrient|reg/popu/gen, control = ctrl) anova(lmm6, lmm6.2, lmm7, lmm7.2) # both models improved anova(lmm6.2, lmm7.2) # similar fit; lmm6.2 more parsimonious summary(lmm6.2)

Assuming a level of significance , the inclusion of random slopes with respect to `nutrient`

improved both `lmm6`

and `lmm7`

. Comparing `lmm6.2`

and`lmm7.2`

head-to-head provides no evidence for differences in fit, so we select the simpler model,`lmm6.2`

. Let’s check how the random intercepts and slopes distribute in the highest level (*i.e.* `gen`

within `popu`

).

plot(ranef(lmm6.2, level = 2))

The random intercepts (left) appear to be normally distributed, except for genotype `34`

, biased towards negative values. This could warrant repeating the entire analysis without this genotype. The random slopes (right), on the other hand, are rather normally distributed. Interestingly, there is a negative correlation of -0.61 between random intercepts and slopes, suggesting that genotypes with low baseline TFPP tend to respond better to fertilization. Try `plot(ranef(lmm6.2, level = 1))`

to observe the distributions at the level of `popu`

only. Next, we will use QQ plots to compare the residual distributions between the GLM and `lmm6.2`

to gauge the relevance of the random effects.

# QQ plots (drawn to the same scale!) par(mfrow = c(1,2)) lims <- c(-3.5,3.5) qqnorm(resid(GLM, type = "normalized"), xlim = lims, ylim = lims,main = "GLM") abline(0,1, col = "red", lty = 2) qqnorm(resid(lmm6.2, type = "normalized"), xlim = lims, ylim = lims, main = "lmm6.2") abline(0,1, col = "red", lty = 2)

The improvement is clear. Bear in mind these results do not change with REML estimation. Try different arrangements of random effects with nesting and random slopes, explore as much as possible!

Now that we are happy with the random structure, we will look into the summary of the optimal model so far (*i.e. *`lmm6.2`

) and determine if we need to modify the fixed structure.

summary(lmm6.2)

All effects are significant with , except for one of the levels from `status`

that represents transplanted plants. Given the significant effect from the other two levels, we will keep `status`

and all current fixed effects. Just for fun, let’s add the interaction term `nutrient:amd`

and see if there is any significant improvement in fit.

lmm8 <- update(lmm6.2, .~. + nutrient:amd) summary(lmm8) anova(lmm8, lmm6.2)

The addition of the interaction was non-significant with respect to both and the goodness-of-fit, so we will drop it. Note that it is not a good idea to add new terms after optimizing the random structure, I did so only because otherwise there would be nothing to do with respect to the fixed structure.

We could play a lot more with different model structures, but to keep it simple let’s finalize the analysis by fitting the `lmm6.2`

model using REML and finally identifying and understanding the differences in the main effects caused by the introduction of random effects.

finalModel <- update(lmm6.2, .~., method = "REML") summary(finalModel)

We will now contrast our REML-fitted final model** **against a REML-fitted GLM and determine the impact of incorporating random intercept and slope, with respect to `nutrient`

, at the level of `popu/gen`

. Therefore, both will be given the same fixed effects and estimated using REML.

dev.off() # Reset previous graphical pars # New GLM, updated from the first by estimating with REML GLM2 <- update(GLM, .~., method = "REML") # Plot side by side, beta with respective SEs plot(coef(GLM2), xlab = "Fixed Effects", ylab = expression(beta), axes = F, pch = 16, col = "black", ylim = c(-.9,2.2)) stdErrors <- coef(summary(GLM2))[,2] segments(x0 = 1:6, x1 = 1:6, y0 = coef(GLM2) - stdErrors, y1 = coef(GLM2) + stdErrors, col = "black") axis(2) abline(h = 0, col = "grey", lty = 2) axis(1, at = 1:6, labels = c("Intercept", "Rack", "Nutrient (Treated)","AMD (Unclipped)","Status (PP)", "Status (Transplant)"), cex.axis = .7) # LMM points(1:6 + .1, fixef(finalModel), pch = 16, col = "red") stdErrorsLMM <- coef(summary(finalModel))[,2] segments(x0 = 1:6 + .1, x1 = 1:6 + .1, y0 = fixef(finalModel) - stdErrorsLMM, y1 = fixef(finalModel) + stdErrorsLMM, col = "red") # Legend legend("topright", legend = c("GLM","LMM"), text.col = c("black","red"), bty = "n")

The figure above depicts the estimated from the different fixed effects, including the intercept, for the GLM (black) and the final LMM (red). Error bars represent the corresponding standard errors (SE). Overall the results are similar but uncover two important differences. First, for all fixed effects except the intercept and `nutrient`

, the SE is smaller in the LMM. Second, the relative effects from two levels of `status`

are opposite. With the consideration of random effects, the LMM estimated a more negative effect of culturing in Petri plates on TFPP, and conversely a less negative effect of transplantation.

plot(finalModel)

The distribution of the residuals as a function of the predicted TFPP values in the LMM is still similar to the first panel in the diagnostic plots of the classic linear model. The usage of additional predictors and generalized additive models would likely improve it.

Now that we account for genotype-within-region random effects, how do we interpret the LMM results?

Plants that were placed in the first rack, left unfertilized, clipped and grown normally have an average TFPP of 2.15. This is the value of the estimated grand mean (*i.e.* intercept), and the predicted TFPP when all other factors and levels do not apply. For example, a plant grown under the same conditions but placed in the second rack will be predicted to have a smaller yield, more precisely of . To these reported yield values, we still need to add the random intercepts predicted for region and genotype within region (which are tiny values, by comparison; think of them as a small adjustment). Moreover, we can state that

- Fertilized plants produce more fruits than those kept unfertilized. This was the strongest main effect and represents a very sensible finding.
- Plants grown in the second rack produce less fruits than those in the first rack. This was the second strongest main effect identified. Could this be due to light / water availability?
- Simulated herbivory (AMD) negatively affects fruit yield. This is also a sensible finding – when plants are attacked, more energy is allocated to build up biochemical defence mechanisms against herbivores and pathogens, hence compromising growth and eventually fruit yield.
- Both culturing in Petri plates and transplantation, albeit indistinguishable, negatively affect fruit yield as opposed to normal growth. When conditions are radically changed, plants must adapt swiftly and this comes at a cost as well. Thus, these observations too make perfect sense.
- One important observation is that the genetic contribution to fruit yield, as gauged by
`gen`

, was normally distributed and adequately modelled as random. One single outlier could eventually be excluded from the analysis. This makes sense assuming plants have evolved universal mechanisms in response to both nutritional status and herbivory that overrule any minor genetic differences among individuals from the same species.

- Always check the residuals and the random effects! While both linear models and LMMs require normally distributed residuals with homogeneous variance, the former assumes independence among observations and the latter normally distributed random effects. Use normalized residuals to establish comparisons.
- One key additional advantage of LMMs we did not discuss is that they can handle missing values.
- Wide format data should be first converted to long format, using
*e.g.*the R package reshape. - Variograms are very helpful in determining spatial or temporal dependence in the residuals. In the case of spatial dependence, bubble plots nicely represent residuals in the space the observations were drown from (
*e.g.*latitude and longitude; refer to Zuur et al. (2009) for more information). - REML estimation is unbiased but does not allow for comparing models with different fixed structures. Only use the REML estimation on the optimal model.

With respect to this particular set of results:

- The analysis outlined here is not as exhaustive as it should be. Among other things, we did neither initially consider interaction terms among fixed effects nor investigate in sufficient depth the random effects from the optimal model.
- The dependent variable (total fruit set per plant) was highly right-skewed and required a log-transformation for basic modeling. The large amount of zeros would in rigour require zero inflated GLMs or similar approaches.
- All predictors used in the analysis were categorical factors. We could similarly use an ANOVA model. LMMs are likely more relevant in the presence of quantitative or mixed types of predictors.

I would like to thank Hans-Peter Piepho for answering my nagging questions over ResearchGate. I hope these superficial considerations were clear and insightful. I look forward for your suggestions and feedback. My next post will cover a joint multivariate model of multiple responses, the graph-guided fused LASSO (GFLASSO) using a R package I am currently developing. Happy holidays!

To **leave a comment** for the author, please follow the link and comment on their blog: ** poissonisfish**.

R-bloggers.com offers