icecream temp=c(11.9, 14.2, 15.2, 16.4, 17.2, 18.1,
18.5, 19.4, 22.1, 22.6, 23.4, 25.1),
units=c(185L, 215L, 332L, 325L, 408L, 421L,
406L, 412L, 522L, 445L, 544L, 614L)
)
I will be comparing the results from this model with others. Hence, I write a helper function to provide consistent graphical outputs:basicPlot plot(units ~ temp, data=icecream, bty="n", lwd=2,
main="Number of ice creams sold", col="#00526D",
xlab="Temperature (Celsius)",
ylab="Units sold", ...)
axis(side = 1, col="grey")
axis(side = 2, col="grey")
}
basicPlot()
As expected more ice cream is sold at higher temperature.lsq.mod basicPlot()
abline(lsq.mod, col="orange", lwd=2)
legend(x="topleft", bty="n", lwd=c(2,2), lty=c(NA,1),
legend=c("observation", "linear least square"),
col=c("#00526D","orange"), pch=c(1,NA))
That's easy and looks not unreasonable. glm
function, in which I specify the "response distribution" (namely the number of ice creams) as Gaussian and the link function from the expected value of the distribution to its parameter (i.e. temperature) as identity. Indeed, this really is the trick with a GLM, it describes how the distribution of the observations and the expected value, often after a smooth transformation, relate to the actual measurements (predictors) in a linear way. The 'link' is the inverse function of the original transformation of the data. lin.mod family=gaussian(link="identity"))
library(arm) # for 'display' function only
display(lin.mod)
## glm(formula = units ~ temp, family = gaussian(link = "identity"),
## data = icecream)
## coef.est coef.se
## (Intercept) -159.47 54.64
## temp 30.09 2.87
## ---
## n = 12, k = 2
## residual deviance = 14536.3, null deviance = 174754.9 (difference = 160218.6)
## overdispersion parameter = 1453.6
## residual sd is sqrt(overdispersion) = 38.13
Thus, to mimic my data I could generate random numbers from the following Normal distribution:rlnorm
.log.lin.mod family=gaussian(link="identity"))
display(log.lin.mod)
## glm(formula = log(units) ~ temp, family = gaussian(link = "identity"),
## data = icecream)
## coef.est coef.se
## (Intercept) 4.40 0.20
## temp 0.08 0.01
## ---
## n = 12, k = 2
## residual deviance = 0.2, null deviance = 1.4 (difference = 1.2)
## overdispersion parameter = 0.0
## residual sd is sqrt(overdispersion) = 0.14
log.lin.sig log.lin.pred basicPlot()
lines(icecream$temp, log.lin.pred, col="red", lwd=2)
legend(x="topleft", bty="n", lwd=c(2,2), lty=c(NA,1),
legend=c("observation", "log-transformed LM"),
col=c("#00526D","red"), pch=c(1,NA))
exp(coef(log.lin.mod)[1])
## (Intercept)
## 81.62131
Although this model makes a little more sense, it appears that is predicts too many sales at the low and high-end of the observed temperature range. Furthermore, there is another problem with this model and the previous linear model as well. The assumed model distributions generate real numbers, but ice cream sales only occur in whole numbers. As a result, any draw from the model distribution should be a whole number.pois.mod family=poisson(link="log"))
display(pois.mod)
## glm(formula = units ~ temp, family = poisson(link = "log"), data = icecream)
## coef.est coef.se
## (Intercept) 4.54 0.08
## temp 0.08 0.00
## ---
## n = 12, k = 2
## residual deviance = 60.0, null deviance = 460.1 (difference = 400.1)
pois.pred basicPlot()
lines(icecream$temp, pois.pred, col="blue", lwd=2)
legend(x="topleft", bty="n", lwd=c(2,2), lty=c(NA,1),
legend=c("observation", "Poisson (log) GLM"),
col=c("#00526D","blue"), pch=c(1,NA))
exp
function to make a prediction of sales for a given temperature. However, R will do this for me automatically, if I set in the predict
statement above type="response"
.predict(pois.mod, newdata=data.frame(temp=32), type="response")
## 1
## 1056.651
Perhaps the exponential growth in my model looks a little too good to be true. Indeed, I am pretty sure that my market will be saturated at around 800. Warning: this is a modelling assumption from my side!market.size icecream$opportunity bin.glm family=binomial(link = "logit"))
display(bin.glm)
## glm(formula = cbind(units, opportunity) ~ temp, family = binomial(link = "logit"),
## data = icecream)
## coef.est coef.se
## (Intercept) -2.97 0.11
## temp 0.16 0.01
## ---
## n = 12, k = 2
## residual deviance = 84.4, null deviance = 909.4 (difference = 825.0)
bin.pred basicPlot()
lines(icecream$temp, bin.pred, col="purple", lwd=2)
legend(x="topleft", bty="n", lwd=c(2,2), lty=c(NA,1),
legend=c("observation", "Binomial (logit) GLM"),
col=c("#00526D","purple"), pch=c(1,NA))
plogis
:# Sales at 0 Celsius
plogis(coef(bin.glm)[1])*market.size
## (Intercept)
## 39.09618
# Sales at 35 Celsius
plogis(coef(bin.glm)[1] + coef(bin.glm)[2]*35)*market.size
## (Intercept)
## 745.7449
So, that is 39 ice creams at 0ºC and 746 ice creams at 35ºC. Yet, these results will of course change if I change my assumptions on the market size. A market size of 1,000 would suggest that I can sell 55 units at 0ºC and 846 at 35ºC. temp p.lm p.log.lm 0.5 * summary(log.lin.mod)$dispersion)
p.pois p.bin basicPlot(xlim=range(temp), ylim=c(-20,market.size))
lines(temp, p.lm, type="l", col="orange", lwd=2)
lines(temp, p.log.lm, type="l", col="red", lwd=2)
lines(temp, p.pois, type="l", col="blue", lwd=2)
lines(temp, p.bin, type="l", col="purple", lwd=2)
legend(x="topleft",
legend=c("observation",
"linear model",
"log-transformed LM",
"Poisson (log) GLM",
"Binomial (logit) GLM"),
col=c("#00526D","orange", "red",
"blue", "purple"),
bty="n", lwd=rep(2,5),
lty=c(NA,rep(1,4)),
pch=c(1,rep(NA,4)))
n A set.seed(1234)
(rand.normal mean=A %*% coef(lin.mod),
sd=sqrt(summary(lin.mod)$dispersion)))
## [1] 152.5502 278.3509 339.2073 244.5335 374.3981 404.4103 375.2385
## [8] 403.3892 483.9470 486.5775 526.3881 557.6662
(rand.logtrans meanlog=A %*% coef(log.lin.mod),
sdlog=sqrt(summary(log.lin.mod)$dispersion)))
## [1] 196.0646 266.1002 326.8102 311.5621 315.0249 321.1920 335.3733
## [8] 564.6961 515.9348 493.4822 530.8356 691.2354
(rand.pois lambda=exp(A %*% coef(pois.mod))))
## [1] 220 266 279 337 364 434 349 432 517 520 516 663
(rand.bin size=market.size,
prob=plogis(A %*% coef(bin.glm))))
## [1] 197 249 302 313 337 386 387 414 516 538 541 594
basicPlot(ylim=c(100,700))
cols alpha.f = 0.75)
points(icecream$temp, rand.normal, pch=19, col=cols[1])
points(icecream$temp, rand.logtrans, pch=19, col=cols[2])
points(icecream$temp, rand.pois, pch=19, col=cols[3])
points(icecream$temp, rand.bin, pch=19, col=cols[4])
legend(x="topleft",
legend=c("observation",
"linear model",
"log-transformed LM",
"Poisson (log) GLM",
"Binomial (logit) GLM"),
col=c("#00526D",cols), lty=NA,
bty="n", lwd=rep(2,5),
pch=c(1,rep(19,4)))
R version 3.2.1 (2015-06-18)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X 10.10.4 (Yosemite)
locale:
[1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] arm_1.8-6 lme4_1.1-8 Matrix_1.2-1 MASS_7.3-42
loaded via a namespace (and not attached):
[1] minqa_1.2.4 tools_3.2.1 coda_0.17-1 abind_1.4-3
[5] Rcpp_0.11.6 splines_3.2.1 nlme_3.1-121 grid_3.2.1
[9] nloptr_1.0.4 lattice_0.20-31
This post was originally published on mages' blog.]]>Linear models are the bread and butter of statistics, but there is a lot more to it than taking a ruler and drawing a line through a couple of points.
Some time ago Rasmus Bååth published an insightful blog article about how such models could be described from a distribution centric point of view, instead of the classic error terms convention.
I think the distribution centric view makes generalised linear models (GLM) much easier to understand as well. That’s the purpose of this post.
Using data on ice cream sales statistics I will set out to illustrate different models, starting with traditional linear least square regression, moving on to a linear model, a log-transformed linear model and then on to generalised linear models, namely a Poisson (log) GLM and Binomial (logistic) GLM. Additionally, I will run a simulation with each model.
Along the way I aim to reveal the intuition behind a GLM using Ramus’ distribution centric description. I hope to clarify why one would want to use different distributions and link functions.
Here is the example data set I will be using. It shows the number of ice creams sold at different temperatures.
icecream <- data.frame(
temp=c(11.9, 14.2, 15.2, 16.4, 17.2, 18.1,
18.5, 19.4, 22.1, 22.6, 23.4, 25.1),
units=c(185L, 215L, 332L, 325L, 408L, 421L,
406L, 412L, 522L, 445L, 544L, 614L)
)
I will be comparing the results from this model with others. Hence, I write a helper function to provide consistent graphical outputs:
basicPlot <- function(...){
plot(units ~ temp, data=icecream, bty="n", lwd=2,
main="Number of ice creams sold", col="#00526D",
xlab="Temperature (Celsius)",
ylab="Units sold", ...)
axis(side = 1, col="grey")
axis(side = 2, col="grey")
}
basicPlot()
As expected more ice cream is sold at higher temperature.
I would like to create a model that predicts the number of ice creams sold for different temperatures, even outside the range of the available data.
I am particularly interested in how my models will behave in the more extreme cases when it is freezing outside, say when the temperature drops to 0ºC and also what it would say for a very hot summer’s day at 35ºC.
My first approach is to take a ruler and draw a straight line through the points, such that it minimises the average distance between the points and the line. That’s basically a linear least square regression line:
lsq.mod <- lsfit(icecream$temp, icecream$units)
basicPlot()
abline(lsq.mod, col="orange", lwd=2)
legend(x="topleft", bty="n", lwd=c(2,2), lty=c(NA,1),
legend=c("observation", "linear least square"),
col=c("#00526D","orange"), pch=c(1,NA))
That’s easy and looks not unreasonable.
For the least square method I didn’t think about distributions at all. It is just common sense, or is it?
Let’s start with a probability distribution centric description of the data.
I believe the observation (y_i) was drawn from a Normal (aka Gaussian) distribution with a mean (mu_i), depending on the temperature (x_i) and a constant variance (sigma^2) across all temperatures.
On another day with the same temperature I might have sold a different quantity of ice cream, but over many days with the same temperature the number of ice creams sold would start to fall into a range defined by (mu_ipmsigma).
Thus, using a distribution centric notation my model looks like this:
[
y_i sim mathcal{N}(mu_i, sigma^2), \
mathbb{E}[y_i]=mu_i = alpha + beta x_i ; text{for all}; i
]Note, going forward I will drop the (text{for all}; i) statement for brevity.
Or, alternatively I might argue that the residuals, the difference between the observations and predicted values, follow a Gaussian distribution with a mean of 0 and variance (sigma^2):
[
varepsilon_i = y_i – mu_i sim mathcal{N}(0, sigma^2)
]Furthermore, the equation
[
mathbb{E}[y_i]=mu_i = alpha + beta x_i
]states my belief that the expected value of (y_i) is identical to the parameter (mu_i) of the underlying distribution, while the variance is constant.
The same model written in the classic error terms convention would be:
[
y_i = alpha + beta x_i + varepsilon_i, text{ with }
varepsilon_i sim mathcal{N}(0, sigma^2)
]I think the probability distribution centric convention makes it clearer that my observation is just one realisation from a distribution. It also emphasises that the parameter of the distribution is modelled linearly.
To model this in R explicitly I use the glm
function, in which I specify the “response distribution” (namely the number of ice creams) as Gaussian and the link function from the expected value of the distribution to its parameter (i.e. temperature) as identity. Indeed, this really is the trick with a GLM, it describes how the distribution of the observations and the expected value, often after a smooth transformation, relate to the actual measurements (predictors) in a linear way. The ‘link’ is the inverse function of the original transformation of the data.
That’s what its says on the GLM tin and that’s all there is to it!
(Actually, there is also the bit, which estimates the parameters from the data via maximum likelihood, but I will skip this here.)
lin.mod <- glm(units ~ temp, data=icecream,
family=gaussian(link="identity"))
library(arm) # for 'display' function only
display(lin.mod)
## glm(formula = units ~ temp, family = gaussian(link = "identity"),
## data = icecream)
## coef.est coef.se
## (Intercept) -159.47 54.64
## temp 30.09 2.87
## ---
## n = 12, k = 2
## residual deviance = 14536.3, null deviance = 174754.9 (difference = 160218.6)
## overdispersion parameter = 1453.6
## residual sd is sqrt(overdispersion) = 38.13
Thus, to mimic my data I could generate random numbers from the following Normal distribution:
[
y_i sim mathcal{N}(mu_i, sigma^2) text{ with }
mu_i = -159.5 + 30.1 ; x_i text{ and }
sigma = 38.1
]Although the linear model looks fine in the range of temperatures observed, it doesn’t make much sense at 0ºC. The intercept is at -159, which would mean that customers return on average 159 units of ice cream on a freezing day. Well, I don’t think so.
Ok, perhaps I can transform the data first. Ideally I would like ensure that the transformed data has only positive values. The first transformation that comes to my mind in those cases is the logarithm.
So, let’s model the ice cream sales on a logarithmic scale. Thus my model changes to:
[
log(y_i) sim mathcal{N}(mu_i, sigma^2)\
mathbb{E}[log(y_i)]=mu_i = alpha + beta x_i
]This model implies that I believe the sales follow a log-normal distribution:[y_i sim logmathcal{N}(mu_i, sigma^2)]Note, the log-normal distribution is skewed to the right, which means that I regard higher sales figures more likely than lower sales figures.
Although the model is still linear on a log-scale, I have to remember to transform the predictions back to the original scale because (mathbb{E}[log(y_i)] neq log(mathbb{E}[y_i])). This is shown below. For a discussion of the transformation of the lognormal distribution see for example the help page of the R function rlnorm
.
[
y_i sim logmathcal{N}(mu_i, sigma^2)\
mathbb{E}[y_i] = exp(mu_i + sigma^2/2)
]
log.lin.mod <- glm(log(units) ~ temp, data=icecream,
family=gaussian(link="identity"))
display(log.lin.mod)
## glm(formula = log(units) ~ temp, family = gaussian(link = "identity"),
## data = icecream)
## coef.est coef.se
## (Intercept) 4.40 0.20
## temp 0.08 0.01
## ---
## n = 12, k = 2
## residual deviance = 0.2, null deviance = 1.4 (difference = 1.2)
## overdispersion parameter = 0.0
## residual sd is sqrt(overdispersion) = 0.14
log.lin.sig <- summary(log.lin.mod)$dispersion
log.lin.pred <- exp(predict(log.lin.mod) + log.lin.sig)
basicPlot()
lines(icecream$temp, log.lin.pred, col="red", lwd=2)
legend(x="topleft", bty="n", lwd=c(2,2), lty=c(NA,1),
legend=c("observation", "log-transformed LM"),
col=c("#00526D","red"), pch=c(1,NA))
This plot looks a little better than the previous linear model and it predicts that I would sell, on average, 82 ice creams when the temperature is 0ºC:
exp(coef(log.lin.mod)[1])
## (Intercept)
## 81.62131
Although this model makes a little more sense, it appears that is predicts too many sales at the low and high-end of the observed temperature range. Furthermore, there is another problem with this model and the previous linear model as well. The assumed model distributions generate real numbers, but ice cream sales only occur in whole numbers. As a result, any draw from the model distribution should be a whole number.
The classic approach for count data is the Poisson distribution.
The Poisson distribution has only one parameter, here (mu_i), which is also its expected value. The canonical link function for (mu_i) is the logarithm, i.e. the logarithm of the expected value is regarded a linear function of the predictors. This means I have to apply the exponential function to the linear model to get back to the original scale. This is distinctively different to the log-transformed linear model above, where the original data was transformed, not the expected value of the data. The log function here is the link function, as it links the transformed expected value to the linear model.
Here is my model:
[
y_i sim text{Poisson}(mu_i)\
mathbb{E}[y_i]=mu_i=exp(alpha + beta x_i) = exp(alpha) exp(beta x_i)\
log(mu_i) = alpha + beta x_i]Let me say this again, although the expected value of my observation is a real number, the Poisson distribution will generate only whole numbers, in line with the actual sales.
pois.mod <- glm(units ~ temp, data=icecream,
family=poisson(link="log"))
display(pois.mod)
## glm(formula = units ~ temp, family = poisson(link = "log"), data = icecream)
## coef.est coef.se
## (Intercept) 4.54 0.08
## temp 0.08 0.00
## ---
## n = 12, k = 2
## residual deviance = 60.0, null deviance = 460.1 (difference = 400.1)
pois.pred <- predict(pois.mod, type="response")
basicPlot()
lines(icecream$temp, pois.pred, col="blue", lwd=2)
legend(x="topleft", bty="n", lwd=c(2,2), lty=c(NA,1),
legend=c("observation", "Poisson (log) GLM"),
col=c("#00526D","blue"), pch=c(1,NA))
This looks pretty good. The interpretation of the coefficients should be clear now. I have to use the exp
function to make a prediction of sales for a given temperature. However, R will do this for me automatically, if I set in the predict
statement above type="response"
.
From the coefficients I can read off that a 0ºC I expect to sell (exp(4.45)=94) ice creams and that with each one degree increase in temperature the sales are predicted to increase by (exp(0.076) – 1 = 7.9%).
Note, the exponential function turns the additive scale into a multiplicative one.
So far, so good. My model is in line with my observations. Additionally, it will not predict negative sales and if I would simulate from a Poisson distribution with a mean given by the above model I will always only get whole numbers back.
However, my model will also predict that I should expect to sell over 1000 ice creams if the temperature reaches 32ºC:
predict(pois.mod, newdata=data.frame(temp=32), type="response")
## 1
## 1056.651
Perhaps the exponential growth in my model looks a little too good to be true. Indeed, I am pretty sure that my market will be saturated at around 800. Warning: this is a modelling assumption from my side!
Ok, let’s me think about the problem this way: If I have 800 potential sales then I’d like to understand the proportion sold at a given temperature.
This suggests a binomial distribution for the number of successful sales out of 800. The key parameter for the binomial distribution is the probability of success, the probability that someone will buy ice cream as a function of temperature.
Dividing my sales statistics by 800 would give me a first proxy for the probability of selling all ice cream.
Therefore I need an S-shape curve that maps the sales statistics into probabilities between 0 and 100%.
A canonical choice is the logistic function:
[
text{logit}(u) = frac{e^u}{e^u + 1} = frac{1}{1 + e^{-u}}
]
With that my model can be described as:
[
y_i sim text{Binom}(n, mu_i)\
mathbb{E}[y_i]=mu_i=text{logit}^{-1}(alpha + beta x_i)\
text{logit}(mu_i) = alpha + beta x_i]
market.size <- 800
icecream$opportunity <- market.size - icecream$units
bin.glm <- glm(cbind(units, opportunity) ~ temp, data=icecream,
family=binomial(link = "logit"))
display(bin.glm)
## glm(formula = cbind(units, opportunity) ~ temp, family = binomial(link = "logit"),
## data = icecream)
## coef.est coef.se
## (Intercept) -2.97 0.11
## temp 0.16 0.01
## ---
## n = 12, k = 2
## residual deviance = 84.4, null deviance = 909.4 (difference = 825.0)
bin.pred <- predict(bin.glm, type="response")*market.size
basicPlot()
lines(icecream$temp, bin.pred, col="purple", lwd=2)
legend(x="topleft", bty="n", lwd=c(2,2), lty=c(NA,1),
legend=c("observation", "Binomial (logit) GLM"),
col=c("#00526D","purple"), pch=c(1,NA))
The chart doesn’t look too bad at all!
As the temperature increases higher and higher this model will predict that sales will reach market saturation, while all the other models so far would predict higher and higher sales.
I can predict sales at 0ºC and 35ºC using the inverse of the logistic function, which is given in R as plogis
:
# Sales at 0 Celsius
plogis(coef(bin.glm)[1])*market.size
## (Intercept)
## 39.09618
# Sales at 35 Celsius
plogis(coef(bin.glm)[1] + coef(bin.glm)[2]*35)*market.size
## (Intercept)
## 745.7449
So, that is 39 ice creams at 0ºC and 746 ice creams at 35ºC. Yet, these results will of course change if I change my assumptions on the market size. A market size of 1,000 would suggest that I can sell 55 units at 0ºC and 846 at 35ºC.
Let’s bring all the models together into one graph, with temperatures ranging from 0 to 35ºC.
temp <- 0:35
p.lm <- predict(lin.mod, data.frame(temp=temp), type="response")
p.log.lm <- exp(predict(log.lin.mod, data.frame(temp=0:35), type="response") +
0.5 * summary(log.lin.mod)$dispersion)
p.pois <- predict(pois.mod, data.frame(temp=temp), type="response")
p.bin <- predict(bin.glm, data.frame(temp=temp), type="response")*market.size
basicPlot(xlim=range(temp), ylim=c(-20,market.size))
lines(temp, p.lm, type="l", col="orange", lwd=2)
lines(temp, p.log.lm, type="l", col="red", lwd=2)
lines(temp, p.pois, type="l", col="blue", lwd=2)
lines(temp, p.bin, type="l", col="purple", lwd=2)
legend(x="topleft",
legend=c("observation",
"linear model",
"log-transformed LM",
"Poisson (log) GLM",
"Binomial (logit) GLM"),
col=c("#00526D","orange", "red",
"blue", "purple"),
bty="n", lwd=rep(2,5),
lty=c(NA,rep(1,4)),
pch=c(1,rep(NA,4)))
The chart shows the predictions of my four models over a temperature range from 0 to 35ºC. Although the linear model looks OK between 10 and perhaps 30ºC, it shows clearly its limitations. The log-transformed linear and Poisson models appear to give similar predictions, but will predict an ever accelerating increase in sales as temperature increases. I don’t believe this makes sense as even the most ice cream loving person can only eat so much ice cream on a really hot day. And that’s why I would go with the Binomial model to predict my ice cream sales.
Having used the distribution centric view to describe my models leads naturally to simulations. If the models are good, then I shouldn’t be able to identify the simulation from the real data.
In all my models the linear structure was
[
g(mu_i) = alpha + beta x_i]or in matrix notation
[
g(mu) = A v] with (A_{i,cdot} = [1, x_i]) and (v=[alpha, beta]), whereby (A) is the model matrix, (v) the vector of coefficients and (g) the link function.
With that being said, let’s simulate data from each distribution for the temperatures measured in the original data and compare against the actual sales units.
n <- nrow(icecream)
A <- model.matrix(units ~ temp, data=icecream)
set.seed(1234)
(rand.normal <- rnorm(n,
mean=A %*% coef(lin.mod),
sd=sqrt(summary(lin.mod)$dispersion)))
## [1] 152.5502 278.3509 339.2073 244.5335 374.3981 404.4103 375.2385
## [8] 403.3892 483.9470 486.5775 526.3881 557.6662
(rand.logtrans <- rlnorm(n,
meanlog=A %*% coef(log.lin.mod),
sdlog=sqrt(summary(log.lin.mod)$dispersion)))
## [1] 196.0646 266.1002 326.8102 311.5621 315.0249 321.1920 335.3733
## [8] 564.6961 515.9348 493.4822 530.8356 691.2354
(rand.pois <- rpois(n,
lambda=exp(A %*% coef(pois.mod))))
## [1] 220 266 279 337 364 434 349 432 517 520 516 663
(rand.bin <- rbinom(n,
size=market.size,
prob=plogis(A %*% coef(bin.glm))))
## [1] 197 249 302 313 337 386 387 414 516 538 541 594
basicPlot(ylim=c(100,700))
cols <- adjustcolor(c("orange", "red", "blue", "purple"),
alpha.f = 0.75)
points(icecream$temp, rand.normal, pch=19, col=cols[1])
points(icecream$temp, rand.logtrans, pch=19, col=cols[2])
points(icecream$temp, rand.pois, pch=19, col=cols[3])
points(icecream$temp, rand.bin, pch=19, col=cols[4])
legend(x="topleft",
legend=c("observation",
"linear model",
"log-transformed LM",
"Poisson (log) GLM",
"Binomial (logit) GLM"),
col=c("#00526D",cols), lty=NA,
bty="n", lwd=rep(2,5),
pch=c(1,rep(19,4)))
The chart displays one simulation per observation from each model, but I believe it shows already some interesting aspects. Not only do I see that the Poisson and Binomial model generate whole numbers, while the Normal and log-transformed Normal predict real numbers, I notice the skewness of the log-normal distribution in the red point at 19.4ºC.
Furthermore, the linear model predicts equally likely figures above and below the mean and at 16.4ºC the prediction appears a little low – perhaps as a result.
Additionally the high sales prediction at 25.1ºC of the log-transformed Normal and Poisson model shouldn’t come as a surprise either.
Again, the simulation of the binomial model seems the most realistic to me.
I hope this little article illustrated the intuition behind generalised linear models. Fitting a model to data requires more than just applying an algorithm. In particular it is worth to think about:
There are many aspects of GLMs which I haven’t touched on here, such as:
You find the code of this post also on GitHub.
R version 3.2.1 (2015-06-18)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X 10.10.4 (Yosemite)
locale:
[1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] arm_1.8-6 lme4_1.1-8 Matrix_1.2-1 MASS_7.3-42
loaded via a namespace (and not attached):
[1] minqa_1.2.4 tools_3.2.1 coda_0.17-1 abind_1.4-3
[5] Rcpp_0.11.6 splines_3.2.1 nlme_3.1-121 grid_3.2.1
[9] nloptr_1.0.4 lattice_0.20-31
The post Visualizing population density appeared first on Decision Science News.
]]>MANHATTAN IS MORE EXTREME THAN YOU THINK IN POPULATION DENSITY
Slinging numbers around all day, one adage we believe is that most surprising statistics are wrong.
But here’s one that’s not: When you look at the 100 most populous counties in the USA, Manhattan (aka New York County) has about twice the population density of the next densest county (Brooklyn, aka Kings County), four times the density of the 5th densest county (San Francisco), and 13 times the density of the 10th densest county (Cook County, IL, home of Chicago).
Population density drops off sharply as you look at highly populated US counties, and New York City has 4 of the top 5. We color code by state in this graph:
The graph at the top of this post represents a square kilometer and draws a dot for every person in various counties. This representation is deceptive at high densities. It would look like a black square long before it got to 1,000,000 people (1,000 people by 1,000 people, each taking up a square meter). We just can’t show 1,000 by 1,000 dots on a graph that size.
We can be more faithful, and make things easier to imagine, if we talk about people per hectare.
But what’s a hectare? Glad you asked. It’s 100 meters by 100 meters. As we see below, it’s roughly one (US) football field by one football field:
Now the differences in densities are still dramatic, but it doesn’t look like people in Manhattan are packed in like sardines.
CODE FOR R, GGPLOT, AND ANIMATION FANS
The post Visualizing population density appeared first on Decision Science News.
The Joint Statistics Meetings starting August 8 is the biggest meetup for statisticians in the world. Navigating the sheer quantity of interesting talks is challenging – there can be up to 50 sessions going on at a time!
To prepare for Seattle, we asked RStudio’s Chief Data Scientist Hadley Wickham for his top session picks. Here are 9 talks, admittedly biased towards R, graphics, and education, that really interested him and might interest you, too.
Undergraduate Curriculum: The Pathway to Sustainable Growth in Our Discipline
Sunday, 1400-1550, CC-607
Statistics with Computing in the Evolving Undergraduate Curriculum
Sunday 1600-1750
In these back to back sessions, learn how statisticians are rising to the challenge of teaching computing and big(ger) data.
Recent Advances in Interactive Graphics for Data Analysis
Monday,1030-1220, CC-608
This is an exciting session discussing innovations in interactive visualisation, and it’s telling that all of them connect with R. Hadley will be speaking about ggvis in this session.
Preparing Students to Work in Industry
Monday, 1400-1550, CC-4C4
If you’re a student about to graduate, we bet there will be some interesting discussion for you here.
Stat computing and graphics mixer
Monday, 1800-2000, S-Ravenna
This is advertised as a business meeting, but don’t be confused. It’s the premier social event for anyone interested in computing or visualisation!
Statistical Computing and Graphics Student Paper Competition
Tuesday, 0830-1020, CC-308
Hear this year’s winners of the student paper award talk about visualising phylogenetic data, multiple change point detection, capture-recapture data and teaching intro stat with R. All four talks come with accompanying R packages!
The Statistics Identity Crisis: Are We Really Data Scientists?
Tuesday, 0830-1020, CC-609
This session, organised by Jeffrey Leek, features an all-star cast of Alyssa Frazee, Chris Volinsky, Lance Waller, and Jenny Bryan.
Doing Good with Data Viz
Wednesday, 0830-1020, CC-2B
Hear Jake Porway, Patrick Ball, and Dino Citraro talk about using data to do good. Of all the sessions, you shouldn’t miss, this is the one you really shouldn’t miss. (But unfortunately it conflicts with another great session – you have difficult choices ahead.)
Statistics at Scale: Applications from Tech Companies
Wednesday, 0830-1020, CC-204
Hilary Parker has organised a fantastic session where you’ll learn how companies like Etsy, Microsoft, and Facebook do statistics at scale. Get there early because this session is going to be PACKED!
There are hundreds of sessions at the JSM, so no doubt we’ve missed other great ones. If you think we’ve missed a “don’t miss” session, please add it to the comments so others can find it.
In between session times you’ll find the RStudio team hanging out at booth #435 in the expo center (at the far back on the right). Please stop by and say hi! We’ll have stickers to show your R pride and printed copies of many of our cheatsheets.
See you in Seattle!
This is the bimonthly post (for 2015-08-04) for new R Jobs from R-users.com.
Employers: visit this link to post a new R job to the R community (it’s free and quick).
Job seekers: please follow the links below to learn more and apply for your job of interest (or visit previous R jobs posts).
]]>
Unlike many of the entries on Wikipedia relating to statistics or computer science, fashion related topics have not not been thoroughly documented. For example, the entries on Martin Margiela and Rei Kawakubo pale in comparison to the breadth of content on John Bayes, structural equation modeling, or R. In lieu of this, I wanted to investigate whether people were using particular fashion related entries on Wikipedia and see how usage patterns had evolved over time. My focus was on the four major fashion weeks given that they are central events within the industry and are paid attention to by tens of millions of people. This analysis is ultimately exploratory and we’re unable to make any inferences about whether an adequate amount of people are using the fashion week entries on Wikipedia or if that’s the result of them not being thoroughly documented. At the end of the day, millions of people use Wikipedia and there’s no doubt that the fashion community needs to be more progressive in ensuring that the fashion related entries on the site are covered in a more cohesive manner.
Unsurprisingly, there are two spikes each year in and around the months where the Fall/Winter and Spring/Summer collections are shown. Of course, the spikes since 2013 have been less pronounced and a gradual trend downwards in visits. This is surprising given the increasing interest in fashion that has occurred over the past five years. This same downward trend also exists in Google search volume on fashion related requests over the past few years.
The line graphs showing visits to the Wikipedia page for the four major fashion weeks are presented below. They each have own characteristics and because these trend charts are explanatory, there’s really no major conclusions to be gleaned from them.
Ultimately, there’s no doubt that high fashion is more popular today than ever before. This is evidenced by sales patterns, amount of media exposure, and the explosion in fashion blogging. This post sought to identify whether people were using Wikipedia to inform themselves about the major fashion weeks and how that trend has changed over time. While those patterns have seen slight increases or remained stagnant, that does not minimize the emergence of high fashion into American popular culture.
R is an incredibly active software project. Since the first code was checked into Subversion back on 18 September 1997, there have been more than 100,000 updates to the R sources by the 20-some members of the R Core Team. As you can see from the activity graphs from the GitHub mirror of the R sources, the pace of development has been frequent and steady over the lifetime of the project:
It's hard to get a sense of the pace of activity from that chart alone, but Matt Dowle (he of the data.table package) published this mesmerizing visualization of R development to YouTube:
Matt created this visualization using Gource, using the aforementioned R GitHub mirror as a data source. It's cool to see all of the R developers moving around the source code tree as it grows over time. If you'd like to drill down into the individual contributors, Romain Francois has some R code and charts (created in 2009) that do exactly that.
I’m announcing the alpha launch of EcoPy: Ecological Data Analysis in Python. EcoPy is a Python module that contains a number of techniques (PCA, CA, CCorA, nMDS, MDS, RDA, etc.) for exploring complex multivariate data. For those of you familiar with R, think of this as the Python equivalent to the ‘vegan‘ package.
However, I’m not done! This is the alpha launch, which means you should exercise caution before using this package for research. I’ve stress-tested a couple of simple examples to ensure I get equivalent output in numerous programs, but I haven’t tested it out with real, messy data yet. There might be broken bits or quirks I don’t know about. For the moment, be sure to verify your results with other software.
That said, I need help! My coding might be sloppy, or you might find errors that I didn’t, or you might suggest improvements to the interface. If you want to contribute, either by helping with coding or stress-testing on real-world examples, let me know. The module is available on github and full documentation is available at readthedocs.org.