(This article was first published on ** Blog - Applied Predictive Modeling**, and kindly contributed to R-bloggers)

My slides from the R/Pharma conference on “Modeling in the Tidyverse” are in pdf format as well as the HTML version.

(Joe Cheng just killed it in his shiny presentation – see this repo)

To **leave a comment** for the author, please follow the link and comment on their blog: ** Blog - Applied Predictive Modeling**.

R-bloggers.com offers

(This article was first published on ** (en) The R Task Force**, and kindly contributed to R-bloggers)

**A blogpost inspired by a tweet and a YouTube video.**

In the last days, I’ve stumbled upon this tweet:

A demo of lightness perception pic.twitter.com/BSVpgcuIw1

— Akiyoshi Kitaoka (@AkiyoshiKitaoka) August 12, 2018

Which is a demonstration of **how our perception of color is affected by the lightness surrounding this color**.

Also recently, I’ve been watching the useR 2018 YouTube recordings, which contains a video called The Grammar of Animation, presenting `{gganimate}`

a package by Thomas Lin Pedersen extending the `{ggplot2}`

grammar of graphics to include animation.

If you want to know more about `{gganimate}`

, feel free to browse the GitHub repo, and watch the YouTube video.

Back to our lightness perception: as you can see, the illusion is made by hand, with a piece of paper, moving the square manually. Let’s recreate this with `gganimate`

.

So, we’ll need three things:

- A
**variable stating all the transition states** - A
**variable for**`y`

min and max, which will stay the same - A
**variable for**`x`

min and max, which will be incremented by one on each transition states.

```
d <- data.frame(
# x coordinates
x1 = 1:10, x2 = 2:11,
# y
y1 = 4, y2 = 5,
# transition time
t = 1:10
)
```

The background is basically a **gradient** from `#000000`

to `#FFFFFF`

(read this post for more about hex color notation). Let’s create that object:

```
library(grid)
g <- rasterGrob(
# Creating the color gradient
t(colorRampPalette(c("#000000", "#FFFFFF"))(1000)),
# Scaling it to fit the graph
width= unit(1,"npc"), height = unit(1,"npc")
)
```

I’ll first create the `ggplot`

object, which is composed of 10 squares, filled with **the same grey**: `"#7E7E7E"`

. I use the `theme_nothing()`

from `{ggmap}`

as an empty theme.

```
library(ggplot2)
gg <- ggplot() +
annotation_custom(g , -Inf, Inf, -Inf, Inf) +
geom_rect(data=d,
mapping=
aes(xmin=x1,
xmax=x2,
ymin=y1,
ymax=y2),
color="black", fill = "#7E7E7E") +
ylim(c(1,8)) +
ggmap::theme_nothing()
gg
```

Let’s now **animate our plot to create the illusion**. As I want the move to be linear, I’ll use a combination of `transition_time`

and `ease_aes('linear')`

to make the transition smooth.

```
library(gganimate)
gg_animated <- gg +
transition_time(t) +
ease_aes('linear')
```

And tadaa !

`gg_animated`

On this animation, **the square appears to be changing color. But it’s not**: the `fill`

is always `"#7E7E7E"`

.

“Every light is a shade, compared to the higher lights, till you come to the sun; and every shade is a light, compared to the deeper shades, till you come to the night.” —John Ruskin, 1879.

OK, let’s forget R to focus on what is happening here, and quickly talk about perception of luminance (the level of light coming to your eye) and color.

We are here **facing a phenomenon known as a “gradient illusion”**. The important idea behind this is that **every color we perceive is influenced by its surrounding**: in other words, we perceive the color lighter on the left of our image, as it is contrasted with black. The more the square move toward the white, the more the grey appears dark.

How does it work? When a color comes to your eye, you perceive a certain amount of luminance. In other words, you are able to tell if something is dark or light or somewhere in the middle. Our ability to do so is called “lightness consistency”, but these gradient illusions show us one thing: this ability is not perfect, and the “luminance environment” in which the color appears influence how we see it.

So, as we’ve just seen, **perception of color is influenced by its surrounding**. When it comes to creating a dataviz, color scales are crucial – even more now that we know that. Let’s imagine that for some weird reason we have created this plot:

```
ggplot() +
annotation_custom(g , -Inf, Inf, -Inf, Inf) +
geom_col(aes(x = 1, y = 3), fill = "#7E7E7E") +
geom_col(aes(x = 8, y = 4), fill = "#7E7E7E")
```

Again,** the fill is the same**, but one bar seems to be darker than the other, which can **trick the reader into thinking the value in the two is not the same**. Something to consider if there is a chance your plot will be turned to black and white.

Let’s say we are drawing a map. Maps are composed of regions, and can be colored following a specific scale. But **there’s a probability that two regions with the very same result on that scale would be surrounded by two opposite colors**. For example, two `#7E7E7E`

could be surrounded one by `#515151`

, the other by `#aeaeae`

.

```
ggplot() +
geom_rect(aes(xmin = 1, xmax = 4, ymin = 1, ymax = 4), fill = "#515151") +
geom_rect(aes(xmin = 2, xmax = 3, ymin = 2, ymax = 3), fill = "#7E7E7E") +
geom_rect(aes(xmin = 4, xmax = 7, ymin = 1, ymax = 4), fill = "#aeaeae") +
geom_rect(aes(xmin = 5, xmax = 6, ymin = 2, ymax = 3), fill = "#7E7E7E")
```

What to do now?

- Now that you know this phenomenom,
**pay attention to it**when you create plots - Be
**careful when chosing palettes** - Try to
**turn your graph to black and white**with`colorspace::desaturate`

```
with_palette <- function(palette) {
x <- y <- seq(-8 * pi, 8 * pi, len = 40)
r <- sqrt(outer(x^2, y^2, "+"))
filled.contour(cos(r^2) * exp(-r / (2 * pi)),
axes = FALSE,
color.palette = palette,
asp = 1
)
}
with_palette(
purrr:::compose(
colorspace::desaturate,
viridis::viridis
)
)
```

- “Lightness Perception and Lightness Illusions”, Edward H. Adelson, 2000 link
- “Lightness models, gradient illusions, and curl”, Lawrence E. Arend and Robert Goldstein, 1987 link

The post Remaking ‘Luminance-gradient-dependent lightness illusion’ with R appeared first on (en) The R Task Force.

To **leave a comment** for the author, please follow the link and comment on their blog: ** (en) The R Task Force**.

R-bloggers.com offers

(This article was first published on ** Shravan Vasishth's Slog (Statistics blog)**, and kindly contributed to R-bloggers)

We wrote a short tutorial on contast coding, covering the common contrast coding scenarios, among them: treatment, helmert, anova, sum, and sliding (successive differences) contrasts. The target audience is psychologists and linguists, but really it is for anyone doing planned experiments.

The paper has not been submitted anywhere yet. We are keen to get user feedback before we do that. Comments and criticism very welcome. Please post comments on this blog, or email me.

Abstract:

Factorial experiments in research on memory, language, and in other areas are often analyzed using analysis of variance (ANOVA). However, for experimental factors with more than two levels, the ANOVA omnibus F-test is not informative about the source of a main effect or interaction. This is unfortunate as researchers typically have specific hypotheses about which condition means differ from each other. A priori contrasts (i.e., comparisons planned before the sample means are known) between specific conditions or combinations of conditions are the appropriate way to represent such hypotheses in the statistical model. Many researchers have pointed out that contrasts should be “tested instead of, rather than as a supplement to, the ordinary `omnibus’ F test” (Hayes, 1973, p. 601). In this tutorial, we explain the mathematics underlying different kinds of contrasts (i.e., treatment, sum, repeated, Helmert, and polynomial contrasts), discuss their properties, and demonstrate how they are applied in the R System for Statistical Computing (R Core Team, 2018). In this context, we explain the generalized inverse which is needed to compute the weight coefficients for contrasts that test hypotheses that are not covered by the default set of contrasts. A detailed understanding of contrast coding is crucial for successful and correct specification in linear models (including linear mixed models). Contrasts defined a priori yield far more precise confirmatory tests of experimental hypotheses than standard omnibus F-test.

Full paper: https://arxiv.org/abs/1807.10451

To **leave a comment** for the author, please follow the link and comment on their blog: ** Shravan Vasishth's Slog (Statistics blog)**.

R-bloggers.com offers

(This article was first published on ** R-posts.com**, and kindly contributed to R-bloggers)

Imagine a situation where you are asked to predict the tourism revenue for a country, let’s say India. In this case, your output or dependent or response variable will be total revenue earned (in USD) in a given year. But, what about independent or predictor variables?

You have been provided with two sets of predictor variables and you have to choose one of the sets to predict your output. The first set consists of three variables:

- X1 = Total number of tourists visiting the country
- X2 = Government spending on tourism marketing
- X3 = a*X1 + b*X2 + c, where a, b and c are some constants

The second set also consists of three variables:

- X1 = Total number of tourists visiting the country
- X2 = Government spending on tourism marketing
- X3 = Average currency exchange rate

Which of the two sets do you think provides us more information in predicting our output?

I am sure, you will agree with me that the second set provides us more information in predicting the output because the second set has three variables which are different from each other and each of the variables provides different information (we can infer this intuitively at this moment). Moreover, none of the three variables is directly derived from the other variables in the system. Alternatively, we can also say that none of the variables is a linear combination of other variables in the system.

In the first set of variables, only two variables provide us relevant information; while, the third variable is nothing but a linear combination of other two variables. If we were to directly develop a model without including this variable, our model would have considered this combination and estimated coefficients accordingly.

Now, this effect in the first set of variables is called multicollinearity. Variables in the first set are strongly correlated to each other (if not all, at least some variables are correlated with other variables). Model developed using the first set of variables may not provide as accurate results as the second one because we are missing out on relevant variables/information in the first set. Therefore, it becomes important to study multicollinearity and the techniques to detect and tackle its effect in regression models.

According to Wikipedia, *“Collinearity is a linear association between two explanatory variables. Two variables are perfectly collinear if there is an exact linear relationship between them. For example, X**1** and X**2** are perfectly collinear if there exist parameters λ**0** and λ**1** such that, for all observations i, we have*

*X**2i = **λ**0 **+ λ**1 *** X**1i*

*Multicollinearity refers to a situation in which two or more explanatory variables in a multiple regression model are highly linearly related.”*

We saw an example of exactly what the Wikipedia definition is describing.

Perfect multicollinearity occurs when one independent variable is an exact linear combination of other variables. For example, you already have X and Y as independent variables and you add another variable, Z = a*X + b*Y, to the set of independent variables. Now, this new variable, Z, does not add any significant or different value than provided by X or Y. The model can adjust itself to set the parameters that this combination is taken care of while determining the coefficients.

Multicollinearity may arise from several factors. Inclusion or incorrect use of dummy variables in the system may lead to multicollinearity. The other reason could be the usage of derived variables, i.e., one variable is computed from other variables in the system. This is similar to the example we took at the beginning of the article. The other reason could be taking variables which are similar in nature or which provide similar information or the variables which have very high correlation among each other.

Multicollinearity may not possess problem at an overall level, but it strongly impacts the individual variables and their predictive power. You may not be able to identify which are statistically significant variables in your model. Moreover, you will be working with a set of variables which provide you similar output or variables which are redundant with respect to other variables.

- It becomes difficult to identify statistically significant variables. Since your model will become very sensitive to the sample you choose to run the model, different samples may show different statistically significant variables.
- Because of multicollinearity, regression coefficients cannot be estimated precisely because the standard errors tend to be very high. Value and even sign of regression coefficients may change when different samples are chosen from the data.
- Model becomes very sensitive to addition or deletion of any independent variable. If you add a variable which is orthogonal to the existing variable, your variable may throw completely different results. Deletion of a variable may also significantly impact the overall results.
- Confidence intervals tend to become wider because of which we may not be able to reject the NULL hypothesis. The NULL hypothesis states that the true population coefficient is zero.

Now, moving on to how to detect the presence of multicollinearity in the system.

There are multiple ways to detect the presence of multicollinearity among the independent or explanatory variables.

- The first and most rudimentary way is to create a pair-wise correlation plot among different variables. In most of the cases, variables will have some bit of correlation among each other, but high correlation coefficient may be a point of concern for us. It may indicate the presence of multicollinearity among variables.

- Large variations in regression coefficients on addition or deletion of new explanatory or independent variables can indicate the presence of multicollinearity. The other thing could be significant change in the regression coefficients from sample to sample. With different samples, different statistically significant variables may come out.
- The other method can be to use tolerance or variance inflation factor (VIF).

VIF = 1 / Tolerance

VIF = 1/ (1 – R square)

VIF of over 10 indicates that the variables have high correlation among each other. Usually, VIF value of less than 4 is considered good for a model.

- The model may have very high R-square value but most of the coefficients are not statistically significant. This kind of a scenario may reflect multicollinearity in the system.
- Farrar-Glauber test is one of the statistical test used to detect multicollinearity. This comprises of three further tests. The first, Chi-square test, examines whether multicollinearity is present in the system. The second test, F-test, determines which regressors or explanatory variables are collinear. The third test, t-test, determines the type or pattern of multicollinearity.

We will now use some of these techniques and try their implementation in R.

We will use CPS_85_Wages data which consists of a random sample of 534 persons from the CPS (Current Population Survey). The data provides information on wages and other characteristics of the workers. (Link – http://lib.stat.cmu.edu/datasets/CPS_85_Wages). You can go through the data details on the link provided.

In this data, we will predict wages from other variables in the data.

> data1 = read.csv(file.choose(), header = T)> head(data1) Education South Sex Experience Union Wage Age Race Occupation Sector Marr1 8 0 1 21 0 5.10 35 2 6 1 12 9 0 1 42 0 4.95 57 3 6 1 13 12 0 0 1 0 6.67 19 3 6 1 04 12 0 0 4 0 4.00 22 3 6 0 05 12 0 0 17 0 7.50 35 3 6 0 16 13 0 0 9 1 13.07 28 3 6 0 0> str(data1)’data.frame’: 534 obs. of 11 variables: $ Education : int 8 9 12 12 12 13 10 12 16 12 … $ South : int 0 0 0 0 0 0 1 0 0 0 … $ Sex : int 1 1 0 0 0 0 0 0 0 0 … $ Experience: int 21 42 1 4 17 9 27 9 11 9 … $ Union : int 0 0 0 0 0 1 0 0 0 0 … $ Wage : num 5.1 4.95 6.67 4 7.5 … $ Age : int 35 57 19 22 35 28 43 27 33 27 … $ Race : int 2 3 3 3 3 3 3 3 3 3 … $ Occupation: int 6 6 6 6 6 6 6 6 6 6 … $ Sector : int 1 1 1 0 0 0 0 0 1 0 … $ Marr : int 1 1 0 0 1 0 0 0 1 0 … |

The above results show the sample view of data and the variables present in the data. Now, let’s fit the linear regression model and analyze the results.

> fit_model1 = lm(log(data1$Wage) ~., data = data1)> summary(fit_model1) Call:lm(formula = log(data1$Wage) ~ ., data = data1) Residuals: Min 1Q Median 3Q Max -2.16246 -0.29163 -0.00469 0.29981 1.98248 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 1.078596 0.687514 1.569 0.117291 Education 0.179366 0.110756 1.619 0.105949 South -0.102360 0.042823 -2.390 0.017187 * Sex -0.221997 0.039907 -5.563 4.24e-08 ***Experience 0.095822 0.110799 0.865 0.387531 Union 0.200483 0.052475 3.821 0.000149 ***Age -0.085444 0.110730 -0.772 0.440671 Race 0.050406 0.028531 1.767 0.077865 . Occupation -0.007417 0.013109 -0.566 0.571761 Sector 0.091458 0.038736 2.361 0.018589 * Marr 0.076611 0.041931 1.827 0.068259 . —Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.4398 on 523 degrees of freedomMultiple R-squared: 0.3185, Adjusted R-squared: 0.3054 F-statistic: 24.44 on 10 and 523 DF, p-value: < 2.2e-16 |

The linear regression results show that the model is statistically significant as the F-statistic has high value and p-value for model is less than 0.05. However, on closer examination we observe that four variables – Education, Experience, Age and Occupation are not statistically significant; while, two variables Race and Marr (martial status) are significant at 10% level. Now, let’s plot the model diagnostics to validate the assumptions of the model.

> plot(fit_model1) Hit |

Hit to see next plot:

Hit to see next plot:

Hit to see next plot:

The diagnostic plots also look fine. Let’s investigate further and look at pair-wise correlation among variables.

library(corrplot) > cor1 = cor(data1) > corrplot.mixed(cor1, lower.col = “black”, number.cex = .7) |

The above correlation plot shows that there is high correlation between experience and age variables. This might be resulting in multicollinearity in the model.

Now, let’s move a step further and try Farrar-Glauber test to further investigate this. The ‘mctest’ package in R provides the Farrar-Glauber test in R.

install.packages(‘mctest’)library(mctest) |

We will first use omcdiag function in mctest package. According to the package description, omcdiag (Overall Multicollinearity Diagnostics Measures) computes different overall measures of multicollinearity diagnostics for matrix of regressors.

> omcdiag(data1[,c(1:5,7:11)],data1$Wage) Call:omcdiag(x = data1[, c(1:5, 7:11)], y = data1$Wage) Overall Multicollinearity Diagnostics MC Results detectionDeterminant |X’X|: 0.0001 1Farrar Chi-Square: 4833.5751 1Red Indicator: 0.1983 0Sum of Lambda Inverse: 10068.8439 1Theil’s Method: 1.2263 1Condition Number: 739.7337 1 1 –> COLLINEARITY is detected by the test 0 –> COLLINEARITY is not detected by the test |

The above output shows that multicollinearity is present in the model. Now, let’s go a step further and check for F-test in in Farrar-Glauber test.

> imcdiag(data1[,c(1:5,7:11)],data1$Wage) Call:imcdiag(x = data1[, c(1:5, 7:11)], y = data1$Wage) All Individual Multicollinearity Diagnostics Result VIF TOL Wi Fi Leamer CVIF KleinEducation 231.1956 0.0043 13402.4982 15106.5849 0.0658 236.4725 1South 1.0468 0.9553 2.7264 3.0731 0.9774 1.0707 0Sex 1.0916 0.9161 5.3351 6.0135 0.9571 1.1165 0Experience 5184.0939 0.0002 301771.2445 340140.5368 0.0139 5302.4188 1Union 1.1209 0.8922 7.0368 7.9315 0.9445 1.1464 0Age 4645.6650 0.0002 270422.7164 304806.1391 0.0147 4751.7005 1Race 1.0371 0.9642 2.1622 2.4372 0.9819 1.0608 0Occupation 1.2982 0.7703 17.3637 19.5715 0.8777 1.3279 0Sector 1.1987 0.8343 11.5670 13.0378 0.9134 1.2260 0Marr 1.0961 0.9123 5.5969 6.3085 0.9551 1.1211 0 1 –> COLLINEARITY is detected by the test 0 –> COLLINEARITY is not detected by the test Education , South , Experience , Age , Race , Occupation , Sector , Marr , coefficient(s) are non-significant may be due to multicollinearity R-square of y on all x: 0.2805 * use method argument to check which regressors may be the reason of collinearity=================================== |

The above output shows that Education, Experience and Age have multicollinearity. Also, the VIF value is very high for these variables. Finally, let’s move to examine the pattern of multicollinearity and conduct t-test for correlation coefficients.

> pcor(data1[,c(1:5,7:11)],method = “pearson”)$estimate Education South Sex Experience Union Age Race OccupationEducation 1.000000000 -0.031750193 0.051510483 -0.99756187 -0.007479144 0.99726160 0.017230877 0.029436911South -0.031750193 1.000000000 -0.030152499 -0.02231360 -0.097548621 0.02152507 -0.111197596 0.008430595Sex 0.051510483 -0.030152499 1.000000000 0.05497703 -0.120087577 -0.05369785 0.020017315 -0.142750864Experience -0.997561873 -0.022313605 0.054977034 1.00000000 -0.010244447 0.99987574 0.010888486 0.042058560Union -0.007479144 -0.097548621 -0.120087577 -0.01024445 1.000000000 0.01223890 -0.107706183 0.212996388Age 0.997261601 0.021525073 -0.053697851 0.99987574 0.012238897 1.00000000 -0.010803310 -0.044140293Race 0.017230877 -0.111197596 0.020017315 0.01088849 -0.107706183 -0.01080331 1.000000000 0.057539374Occupation 0.029436911 0.008430595 -0.142750864 0.04205856 0.212996388 -0.04414029 0.057539374 1.000000000Sector -0.021253493 -0.021518760 -0.112146760 -0.01326166 -0.013531482 0.01456575 0.006412099 0.314746868Marr -0.040302967 0.030418218 0.004163264 -0.04097664 0.068918496 0.04509033 0.055645964 -0.018580965 Sector MarrEducation -0.021253493 -0.040302967South -0.021518760 0.030418218Sex -0.112146760 0.004163264Experience -0.013261665 -0.040976643Union -0.013531482 0.068918496Age 0.014565751 0.045090327Race 0.006412099 0.055645964Occupation 0.314746868 -0.018580965Sector 1.000000000 0.036495494Marr 0.036495494 1.000000000 $p.value Education South Sex Experience Union Age Race Occupation SectorEducation 0.0000000 0.46745162 0.238259049 0.0000000 8.641246e-01 0.0000000 0.69337880 5.005235e-01 6.267278e-01South 0.4674516 0.00000000 0.490162786 0.6096300 2.526916e-02 0.6223281 0.01070652 8.470400e-01 6.224302e-01Sex 0.2382590 0.49016279 0.000000000 0.2080904 5.822656e-03 0.2188841 0.64692038 1.027137e-03 1.005138e-02Experience 0.0000000 0.60962999 0.208090393 0.0000000 8.146741e-01 0.0000000 0.80325456 3.356824e-01 7.615531e-01Union 0.8641246 0.02526916 0.005822656 0.8146741 0.000000e+00 0.7794483 0.01345383 8.220095e-07 7.568528e-01Age 0.0000000 0.62232811 0.218884070 0.0000000 7.794483e-01 0.0000000 0.80476248 3.122902e-01 7.389200e-01Race 0.6933788 0.01070652 0.646920379 0.8032546 1.345383e-02 0.8047625 0.00000000 1.876376e-01 8.833600e-01Occupation 0.5005235 0.84704000 0.001027137 0.3356824 8.220095e-07 0.3122902 0.18763758 0.000000e+00 1.467261e-13Sector 0.6267278 0.62243025 0.010051378 0.7615531 7.568528e-01 0.7389200 0.88336002 1.467261e-13 0.000000e+00Marr 0.3562616 0.48634504 0.924111163 0.3482728 1.143954e-01 0.3019796 0.20260170 6.707116e-01 4.035489e-01 MarrEducation 0.3562616South 0.4863450Sex 0.9241112Experience 0.3482728Union 0.1143954Age 0.3019796Race 0.2026017Occupation 0.6707116Sector 0.4035489Marr 0.0000000 $statistic Education South Sex Experience Union Age Race Occupation SectorEducation 0.0000000 -0.7271618 1.18069629 -327.2105031 -0.1712102 308.6803174 0.3944914 0.6741338 -0.4866246South -0.7271618 0.0000000 -0.69053623 -0.5109090 -2.2436907 0.4928456 -2.5613138 0.1929920 -0.4927010Sex 1.1806963 -0.6905362 0.00000000 1.2603880 -2.7689685 -1.2309760 0.4583091 -3.3015287 -2.5834540Experience -327.2105031 -0.5109090 1.26038801 0.0000000 -0.2345184 1451.9092015 0.2492636 0.9636171 -0.3036001Union -0.1712102 -2.2436907 -2.76896848 -0.2345184 0.0000000 0.2801822 -2.4799336 4.9902208 -0.3097781Age 308.6803174 0.4928456 -1.23097601 1451.9092015 0.2801822 0.0000000 -0.2473135 -1.0114033 0.3334607Race 0.3944914 -2.5613138 0.45830912 0.2492636 -2.4799336 -0.2473135 0.0000000 1.3193223 0.1467827Occupation 0.6741338 0.1929920 -3.30152873 0.9636171 4.9902208 -1.0114033 1.3193223 0.0000000 7.5906763Sector -0.4866246 -0.4927010 -2.58345399 -0.3036001 -0.3097781 0.3334607 0.1467827 7.5906763 0.0000000Marr -0.9233273 0.6966272 0.09530228 -0.9387867 1.5813765 1.0332156 1.2757711 -0.4254112 0.8359769 MarrEducation -0.92332727South 0.69662719Sex 0.09530228Experience -0.93878671Union 1.58137652Age 1.03321563Race 1.27577106Occupation -0.42541117Sector 0.83597695Marr 0.00000000 $n[1] 534 $gp[1] 8 $method[1] “pearson” |

As we saw earlier in the correlation plot, partial correlation between age-experience, age-education and education-experience is statistically significant. There are other pairs also which are statistically significant. Thus, Farrar-Glauber test helps us in identifying the variables which are causing multicollinearity in the model.

There are multiple ways to overcome the problem of multicollinearity. You may use ridge regression or principal component regression or partial least squares regression. The alternate way could be to drop off variables which are resulting in multicollinearity. You may drop of variables which have VIF more than 10. In our case, since age and experience are highly correlated, you may drop one of these variables and build the model again. Try building the model again by removing experience or age and check if you are getting better results. Share your experiences in the comments section below.

**Author Bio:**

This article was contributed by Perceptive Analytics. Jyothirmayee Thondamallu, Chaitanya Sagar and Saneesh Veetil contributed to this article.

Perceptive Analytics provides Data Analytics, business intelligence and reporting services to e-commerce, retail, healthcare and pharmaceutical industries. Its client roster includes Fortune 500 and NYSE listed companies in the USA and India.

To **leave a comment** for the author, please follow the link and comment on their blog: ** R-posts.com**.

R-bloggers.com offers

(This article was first published on ** R-exercises**, and kindly contributed to R-bloggers)

In this exercise, we will continue to solve problems from the last exercise about GLM here. Therefore, the exercise number will start at 9. Please make sure you read and follow the previous exercise before you continue practicing.

In the last exercise, we knew that there was over-dispersion over the model. So, we tried to use Quasi-Poisson regression, along with step-wise variable selection algorithms. Please note, here we assumed there is no influence from the background theory or knowledge behind the data. Obviously, there is no such thing in the real world, but we just use this step as an exercise in general.

Answers to these exercises are available here. If you obtained a different (correct) answer than those listed on the solutions page, please feel free to post your answer as a comment on that page. Load the data-set and required package before running the exercise.

**Exercise 9**

Load the package called “MASS” to execute the negative binomial model. Run the package; consider all the explanatory variables.

**Exercise 10**

Check the summary of the model.

**Exercise 11**

Set options in base R, considering missing values.

**Exercise 12**

The previous exercise gave insight that variables 1,3,4,6 or 1,4,6 produce the best model performance. Therefore, refit the model using those variables.

**Exercise 13**

Check the diagnostic plot and generate a conclusion based on if the model gives the best performance.

To **leave a comment** for the author, please follow the link and comment on their blog: ** R-exercises**.

R-bloggers.com offers

(This article was first published on ** R – Insights of a PhD**, and kindly contributed to R-bloggers)

I just came across a nice little post on acquiring and visualizing geodata in R using the Max Planck Institute of Ornithology as an example. It’s by the rOpenSci guys. Some useful code in there by the look of it… Worth a look…

To **leave a comment** for the author, please follow the link and comment on their blog: ** R – Insights of a PhD**.

R-bloggers.com offers

(This article was first published on ** R – Win-Vector Blog**, and kindly contributed to R-bloggers)

We are pleased and excited to announce that we are working on a second edition of *Practical Data Science with R*!

Manning Publications has just announced the launching of the MEAP (Manning Early Access Program) for the second edition. The MEAP allows you to subscribe to drafts of chapters as they become available, and give us feedback before the book goes into print. Currently, drafts of the first three chapters are available.

If you’ve been contemplating buying the first edition, and haven’t yet, never fear! If you subscribe to the MEAP for the second edition, an eBook copy of the previous edition, *Practical Data Science with R (First Edition)*, is included at no additional cost.

In addition to the topics that we covered in the first edition, we plan to add: additional material on using the `vtreat`

package for data preparation; a discussion of LIME for model explanation; and sections on modeling techniques that we didn’t cover in the first edition, such as gradient boosting, regularized regression, and auto-encoders.

Please subscribe to our book, your support now will help us improve it. Please also forward this offer to your friends and colleagues (and please ask them to also subscribe and forward).

Manning is sharing a 50% off promotion code active until August 23, 2018: **mlzumel3**.

To **leave a comment** for the author, please follow the link and comment on their blog: ** R – Win-Vector Blog**.

R-bloggers.com offers

(This article was first published on ** Blog - Applied Predictive Modeling**, and kindly contributed to R-bloggers)

I’ll be giving a talk at the R/Medicine conference on Sept 7th in New Haven CT.

My talk is on modeling in the tidyverse but there are some excellent speakers. Rob Tibshirani, Mike Lawrence, Jennifer Thompson, and a bunch of others will be there.

Take look at the conference website for more details.

To **leave a comment** for the author, please follow the link and comment on their blog: ** Blog - Applied Predictive Modeling**.

R-bloggers.com offers