Related Post

`ggplot2`

and `ggmap`

from CRAN will result in an error. To avoid the error be sure to install both packages from GitHub with the package `devtools`

and restart R if the problem persists.
devtools::install_github("dkahle/ggmap") devtools::install_github("hadley/ggplot2")

Metro systems are an interesting way to learn more about the growth of a city over time. You can see things like how the city expanded as public transit spread farther and farther from the original city limits. You can also see how the city center moved from certain neighborhoods to others. One example of this is the city of Paris, where I currently live, which started off just having metro stops along the river, and then quickly spread to a more circular shape over time. The gif below shows that progression over time. Blue dots are metro stops and the red dot is the center of the metro system.

By the end of these three post you will be able to make that gif yourself, as well as gifs for three other European cities. To do this we’ll be playing around with several R packages with the final goal of making gifs with Delaunay triangulations. Of the packages we’ll be using, several will be from the `tidyverse`

. However, instead of loading them all in one package, we’ll load each separately so you can get a better idea for what each package can be used for. In the future though I highly recommend the single `library(tidyverse)`

call to make your life easier.

The tutorial is cut into three posts: 1) making maps with metro stops, 2) making maps with Delaunay triangulations and centroids, and 3) making maps that change over time, where we’ll make the gif above.

Today’s data is the location of metro stops in four European cities: Paris, Berlin, Barcelona, and Prague. To collect the names of stops from each city I went to the Wikipedia article for each respective city’s metro system. I also coded if the stop was actually in the city being analyzed or a different town, usually bordering the city.

With my data in place I began to work with it in R to organize it. I used three packages to start off, `dplyr`

, `tidyr`

(both in `tidyverse`

), and `ggmap`

. With `ggmap`

you can download maps from various sources, including Google Maps, and plot them in the `ggplot2`

environment. I first read in my data and then create a new column called `geo_location`

by combining the `station`

and `location`

columns with a `unite()`

call. I also use the `separate()`

call, the converse of `unite()`

to split the `opened`

column (which refers to the date when the stop was opened) into three columns, one for month, day and year. Now I get to use my first `ggmap`

call, `mutate_geocode()`

. I can feed the call my `geo_location`

column from my data frame and it will make two new columns, `lon`

and `lat`

, finding the longitude and latitude of each stop, and add these values to my new columns. Note, I originally tried added the word “Station” at the end of the stop for all stops but this caused problems.

library(dplyr) library(tidyr) library(ggmap) data = read.table("https://raw.githubusercontent.com/pagepiccinini/blog/master/2016-09-27_metros/data_metros.txt", header=T, sep="\t") %>% unite(geo_location, c(station, location), sep = ", ", remove = FALSE) %>% separate(opened, into = c("opened_month", "opened_day", "opened_year"), sep = "/") %>% mutate_geocode(geo_location, source = "google")

The output from Google Maps is not exactly the same as the Google Maps API. I tried to hand correct errors as much as possible, but I am not an expert on European Metro systems. If you see an erroneous data point from your city feel free to let me know! The final data below is thus a combination of data from the `mutate_geocode`

call and any hand correction on my part. Below you can see some of the data we’ve created. I’ve only included the first 6 data points for the sake of space, but you can look at all of the data in the GitHub repository.

data <- read.table("https://raw.githubusercontent.com/pagepiccinini/blog/master/2016-09-27_metros/data_metro_full.txt", header=T, sep="\t") head(data)city geo_location location station line 1 Paris Abbesses, Paris, France Paris, France Abbesses 12 2 Paris Alésia, Paris, France Paris, France Alésia 4 3 Paris Alexandre Dumas, Paris, France Paris, France Alexandre Dumas 2 4 Paris Alma – Marceau, Paris, France Paris, France Alma – Marceau 9 5 Paris Anatole France, Levallois-Perret, France Levallois-Perret, France Anatole France 3 6 Paris Anvers, Paris, France Paris, France Anvers 2 opened_month opened_day opened_year lon lat 1 10 31 1912 2.338559 48.88430 2 10 30 1909 2.327058 48.82820 3 1 31 1903 2.394419 48.85633 4 5 27 1923 2.352222 48.85661 5 9 24 1937 2.284904 48.89223 6 10 7 1902 2.344253 48.88285

With our data in place we can start making our maps. This brings us to our second `ggmap`

call, `get_googlemap()`

. With this call I can download city specific maps for my four cities by setting `center`

to each of my cities. I can also set the type of map (terrain, satellite, roadmap, hybrid), how close to zoom in (integers that range from continent to building), the size of my map in pixels, and if I want the map in black and white or color.

paris_map = get_googlemap(center = "Paris", maptype = "roadmap", zoom = 11, size = c(640, 420), color = "bw") berlin_map = get_googlemap(center = "Berlin", maptype = "roadmap", zoom = 10, size = c(640, 420), color = "bw") barcelona_map = get_googlemap(center = "Barcelona", maptype = "roadmap", zoom = 11, size = c(640, 420), color = "bw") prague_map = get_googlemap(center = "Prague", maptype = "roadmap", zoom = 11, size = c(640, 420), color = "bw")

With our map objects saved from Google we can now plot our maps and our metro stops on top. Since I’ll be making roughly the same plot each time I wrote a function which you can see below. The main difference from a typical `ggplot2`

plot is instead of using `ggplot()`

to start off the plot you use `ggmap()`

and then feed it the map we had saved. The setting `extent = "device"`

is used to suppress the x and y axes with their tick marks. From then on it takes the same `ggplot2`

calls as any other plot. For example, we can use `geom_point()`

to plot our metro stops. See the maps with metro stops for the four cities below. I’ve included the code for the Paris map for example, but hidden the rest since it is basically the same.

city_plot = function(city_name, city_map){ ggmap(city_map, extent = "device") + geom_point(data = subset(data, city == city_name), aes(x = lon, y = lat), color = "#0571b0", size = 3) } paris.plot = city_plot("Paris", paris_map) paris.plot

In this post we pulled down geolocation information from Google for metro stops in four cities. We then plotted those stops on top of maps of the cities. In the next post we’ll investigate the relative sizes of these metro networks and where the “center” of the city is according to its metro system.

Related Post

Related Post

- Regression model with auto correlated errors – Part 2, the models
- Regression model with auto correlated errors – Part 1, the data
- Linear Regression from Scratch in R
- R for Publication by Page Piccinini: Lesson 5 – Analysis of Variance (ANOVA)
- R for Publication by Page Piccinini: Lesson 4 – Multiple Regression

Averaging the adjusted counts for the February’s, April’s, June’s, August’s, October’s, and December’s, months that are 2/3 positive and 1/3 negative, the result is 616.7 divorces with an estimated standard error of 15.8. Averaging the adjusted counts for the January’s, March’s, May’s, July’s, September’s, and November’s, the months that are 2/3 negative and 1/3 positive, the result is 661.5 divorces with an estimated standard error of 15.8. The difference between the two groupings is -44.9 with an estimated standard error of 25.8. So, positive months have fewer divorces than negative months in this data set – not what I would expect since the positive signs are associated with action and the negative signs with reaction.

Averaging the adjusted counts for the January’s, April’s, July’s, and October’s, months that are 2/3 cardinal and 1/3 fixed, the result is 627.3 divorces with an estimated standard error of 22.3. Averaging the counts for the February’s, May’s, August’s, and November’s, months that are 2/3 fixed and 1/3 mutable, the result is 611.0 divorces with an estimated standard error of 21.8. Averaging the counts for the March’s, June’s, September’s, and November’s, months that are 2/3 mutable and 1/3 cardinal, the result is 681.2 divorces with an estimated standard error of 22.8.

Looking at the three differences, between the first group and the second group – mainly cardinal versus mainly mutable – the difference is -53.9 with an estimated standard error of 35.8, between the first group and the third group – mainly cardinal versus mainly fixed – there is a difference of 16.3 with an estimated standard error of 34.4, and between the second group and the third group – mainly fixed versus mainly mutable – there is a difference of -70.2, with an estimated standard error of 35.3. Most of the differences between the groups are not easily explained by random variation and the results are in agreement with what one would expect from astrology. Cardinal and mutable signs are associated with change and adjustment, while fixed signs are associated with steadfastness.

The estimated covariance matrix for the 40 observations is found for the adjusted counts. We use the model `mod.arima.r.1`

from part 2.

First, the matrix `cov.r.1`

is created as a matrix of zeros. The sum of the squares of the estimated coefficients of the inclusions from the model is placed on the diagonal.

cov.r.1 = matrix(0,40,40) diag(cov.r.1)= sum(c(1,mod.arima.r.1$coef[1:13]^2))

Next, the off-diagonal terms of the covariance matrix are put in the matrix, starting with the first row and column.

for (i in 1:12) { cov.r.1[cbind(c(1,(i+1)),c(i+1,1))]= sum(c(1,mod.arima.r.1$coef[1:(13-i)])* mod.arima.r.1$coef[i:13]) } cov.r.1[cbind(c(1,14),c(14,1))] = mod.arima.r.1$coef[13] for (i in 2:14) { cov.r.1[(i-1)+2:14, i] = cov.r.1[2:14, 1] cov.r.1[i, (i-1)+2:14] = cov.r.1[2:14, 1] }

Last, `cov.r.1`

is multiplied by the estimated variance of the inclusions from the `arima()`

model.

cov.r.1 = cov.r.1*mod.arima.r.1$sigma2

The average counts for five sets of months along with their standard errors are found. The matrix t.m is the matrix of multipliers for finding the means and standard errors. The five rows of t are for positive signs, negative signs, cardinal signs, fixed signs, and mutable signs.

First the matrix `t.m`

is created with five rows and filled with zeros. Row 1 selects the mainly positive months. Row 2 selects the mainly negative months. Row 3 selects the mainly cardinal months. Row 4 selects the mainly fixed months. Row 5 selects the mainly mutable months. Last, the means and standard errors are found and displayed.

t.m = matrix(0,5,40) t.m[1,] = rep(1:0, 20) t.m[2,] = rep(0:1, 20) t.m[3,] = c(rep(c(0,0,1),13),0) t.m[4,] = c(rep(c(1,0,0),13),1) t.m[5,] = c(rep(c(0,1,0),13),0) m.m = t.m%*%div.a.ts/apply(t.m,1,sum) s.m = sqrt(diag(t.m%*%cov.r.1%*%t(t.m)))/ apply(t.m,1,sum) m.m s.m

The four tests are done using a similar structure as with the means. The four tests are positive vs negative, cardinal vs fixed, cardinal vs mutable, and fixed vs mutable.

t.t = matrix(0,4,40) t.t[1,] = rep(c(1/20,-1/20),20) t.t[2,] = c(rep(c(0,-1/13,1/13), 13), 0) t.t[3,] = c(rep(c(-1/14,0,1/13), 13),-1/14) t.t[4,] = c(rep(c(1/14,-1/13,0), 13), 1/14) m.t = t.t%*%div.a.ts s.t = sqrt(diag(t.t%*%cov.r.1%*%t(t.t))) m.t s.t

Finding p-values for the four test statistics shows that all but the difference between mainly cardinal and mainly fixed are significant at a level of 15% for two-sided tests. The four test statistics are -1.74 for positive verses negative with a p-value of 0.0452; -1.51 for cardinal versus mutable with a p-value of 0.0704; 0.47 with a p-value of 0.6806 for cardinal versus fixed; and -1.99 with a p-value of 0.0274 for fixed versus mutable.

The code for the four test statistics and the code for the p-values are:

m.t/s.t pnorm(m.t/s.t)

The negative direction of the lag of 13, could be explained by the cycles of the Moon. There are approximately 13 Moon cycles per year, so for a given month, the Moon will start in a positive sign one year and in a negative sign the next, over reasonably short time periods. The estimated standard error for the means over the 13 lags is 37.0 except for the mean starting with the first observation, which has an estimated standard error of 30.0. There are 13 means from the 40 observations.

The means and standard errors for the observations lagged by 13 are found, using a similar method as for the means and tests.

vec.14 = c(1,1+seq(13,40,13)) vec.13 = c(2,2+seq(13,26,13)) t.13 = matrix(0,13,40) t.13[1,vec.14] = 1/4 for (i in 2:13) t.13[i,vec.13+(i-2)] = 1/3 m.13 = t.13 %*% div.a.ts s.13 = sqrt(diag(t.13 %*% cov.r.1 %*% t(t.13))) m.13 s.13

In this study, the sample size is small. A value of 0.15 for alpha is reasonable for such a small sample. According to time series theory, the distributions of the means and test statistics should be asymptotically normal and the standard errors should converge to their parametric value, so confidence intervals based on normality and z-tests for the test statistics may be appropriate, even though the sample is small. The diagnostic plots from part 2 indicate that the errors from the model follow a normal distribution quite closely, which supports using the assumption of normality.

Of interest is the correspondence between divorce count means and astrological theory for cardinal, fixed, and mutable signs. In the sample, people were less likely to divorce in months that were mainly fixed than in months that were mainly cardinal or mutable, with the difference being largest for months that were 2/3 fixed and 1/3 mutable as compared to months that were 1/3 cardinal and 2/3 mutable.

For the entire model, the model with an increase in divorce counts as the time passes after a shift in the astronomical point I call Vulcan fits the data best. The result is encouraging and deserves further study.

Related Post

- Regression model with auto correlated errors – Part 2, the models
- Regression model with auto correlated errors – Part 1, the data
- Linear Regression from Scratch in R
- R for Publication by Page Piccinini: Lesson 5 – Analysis of Variance (ANOVA)
- R for Publication by Page Piccinini: Lesson 4 – Multiple Regression

Related Post

- Regression model with auto correlated errors – Part 3, some astrology
- Regression model with auto correlated errors – Part 1, the data
- Linear Regression from Scratch in R
- R for Publication by Page Piccinini: Lesson 5 – Analysis of Variance (ANOVA)
- R for Publication by Page Piccinini: Lesson 4 – Multiple Regression

mod.lm = lm(div.a.ts~vulcan.ts) summary(mod.lm)Call: lm(formula = div.a.ts ~ vulcan.ts) Residuals: Min 1Q Median 3Q Max -159.30 -53.88 -10.37 53.48 194.05 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 621.23955 20.24209 30.690 <2e-16 *** vulcan.ts 0.06049 0.05440 1.112 0.273 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 77.91 on 38 degrees of freedom Multiple R-squared: 0.03151, Adjusted R-squared: 0.006026 F-statistic: 1.236 on 1 and 38 DF, p-value: 0.2731

Clearly, there is little reason to think that the jump points are related to the adjusted divorce counts given the simple regression model. However, since the data is time series data, there is a possibility that the errors are autocorrelated.

Following the Box-Jenkins approach to fitting time series, I decided to start by looking at the residuals from the adjusted divorce count model as a stationary time series. I generated two plots; a plot of the correlations between present and lagged variables using the R function `acf()`

and a plot of the partial correlations between present and lagged variables using the R function `pacf()`

. For the autocorrelation and the partial autocorrelation plots the blue lines represent the 95% level.

acf(mod.lm$residuals, main="autocorrelation of the residuals fitting divorce counts on Ceres, Juno, Vesta, and Pallas breaks") pacf(mod.lm$residuals, main="partial autocorrelation of the residuals fitting divorce counts on Ceres, Juno, Vesta, and Pallas breaks ")

Here there are the plot: First the auto-correlations and next the partial auto-correlations.

Following the Box-Jenkins approach, the correlation plot will indicate if the time series is a moving average. If there are spikes in the correlation plot, the spikes indicate the orders of the moving average. An autoregressive time series will have an exponentially decaying correlation plot.

The partial correlation plot indicates the order of the process if the process is autoregressive. The order of the auto regressive process is given by the value of the lag just before the partial correlation goes to essentially zero.

A process with both moving average and auto regressive terms is hard to identify using correlation and partial correlation plots, but the process is indicated by decay that is not exponential, but starts after a few lags (see the Wikipedia page Box-Jenkins).

Looking at the above plots, the autocorrelation plot shows negative spikes at lag 1 and lag 13 and is large in absolute value at lags 3, 12, and 14, but otherwise is consistent with random noise. The partial correlation plot has negative spikes at lags 1, 2, and 13 and is large in absolute value at lag 15, indicating that the process may not be stationary. Neither plot decays.

Using periodograms is another way you can evaluate autocorrelation. A periodogram can find periodicities in a time series. Below are plots of the periodogram and the normalized cumulative periodogram created using `spec.pgram()`

and `cpgram()`

in R. In the periodogram, there are spikes at about the frequencies .35 and .40 that are quite large. The frequencies correspond to lags of 2.9 and 2.5, somewhat close to the lags of 2 and 3 months found in the correlation plot. The normalized cumulative periodogram shows a departure from random noise at a frequency of around .29, or a lag of around 3.4. The blue lines in the normalized cumulative periodogram plot are the 95% confidence limits for random noise.

Plot the periodogram and cumulative periodogram of the residuals from the model.

spec.pgram(mod.lm$residuals, main="Residuals from Simple Regression Raw Periodogram") cpgram(mod.lm$residuals, main="Residuals from Simple Regression Cumulative Periodogram")

First the periodogram and next the cumulative periodogram.

Note also that at the frequency of twenty, corresponding to a lag of two, the periodogram is also high.

I used the function `arima()`

in R to fit some autoregressive moving average models with orders up to thirteen, both with and without the astronomical regressor. The best model, using the Aktaike information criteria (see the Wikipedia page Aktaike information criterion), was $$Y{_t}=\beta_{0} + \beta_{1}X_{t} + \theta{_1}\epsilon_{t-1} + \theta{_3} \epsilon_{t-1} + \theta_{13} \epsilon_{t-13} + \epsilon_{t}$$ where Y_{t} is the number of divorces in month t, X_{t} is the number of days since the last shift in the mean of the celestial longitudes of Ceres, Pallas, Vesta, and Juno over the shortest arc, and ε_{t} is the difference between the expected number of divorces and the actual number of divorces for month t.

The estimated coefficients, with the respective standard errors, are given in the table below.

Coefficient | Standard Error | |

βhat_{0} |
624 | 10.5 |

βhat_{1} |
0.054 | 0.031 |

θhat_{1} |
-0.682 | 0.274 |

θhat_{3} |
0.386 | 0.180 |

θhat_{13} |
-0.598 | 0.227 |

The final model was fit using `arima()`

. Below, two models are fit, the final one (with the regressor) and one without the regressor.

In arima(), in the argument fixed, an NA indicates the parameter is fit and a 0 indicates no parameter is fit. In fixed there is a value for every value indicated by the arguments order and xreg, plus the intercept. So in the first model below, there are 13 moving average terms, three of which are fit, and there is an intercept and one regressor term, for 15 values in fixed.

The model with the regressor.

mod.arima.r.1 = arima(div.a.ts, order=c(0,0,13), fixed=c(NA,0,NA,rep(0,9), NA,NA,NA), xreg=vulcan.ts) mod.arima.r.1Call: arima(x = div.a.ts, order = c(0, 0, 13), xreg = vulcan.ts, fixed = c(NA, 0, NA, rep(0, 9), NA, NA, NA)) Coefficients: ma1 ma2 ma3 ma4 ma5 ma6 ma7 ma8 ma9 ma10 ma11 ma12 ma13 intercept -0.6822 0 0.3855 0 0 0 0 0 0 0 0 0 -0.5975 623.7144 s.e. 0.2744 0 0.1799 0 0 0 0 0 0 0 0 0 0.2270 10.4553 vulcan.ts 0.0541 s.e. 0.0309 sigma^2 estimated as 2618: log likelihood = -220.54, aic = 453.07

The model without the regressor

mod.arima.r.2 = arima(div.a.ts, order=c(0,0,13), fixed=c(NA,0,NA,rep(0,9), NA,NA)) mod.arima.r.2Call: arima(x = div.a.ts, order = c(0, 0, 13), fixed = c(NA, 0, NA, rep(0, 9), NA, NA)) Coefficients: ma1 ma2 ma3 ma4 ma5 ma6 ma7 ma8 ma9 ma10 ma11 ma12 ma13 intercept -0.6044 0 0.4800 0 0 0 0 0 0 0 0 0 -0.6596 640.3849 s.e. 0.2025 0 0.1775 0 0 0 0 0 0 0 0 0 0.2266 4.7133 sigma^2 estimated as 2655: log likelihood = -221.92, aic = 453.85

The coefficient for the slope of the regression is not at least 1.96 times the size of the standard errors of the coefficient, however, the Aktaike information criteria is the smallest for the models at which I looked. The model without the coefficient is a close contender, with the Aktaike information for the two models being 453.07 and 453.85.

The residuals from the two models are plotted together using ts.plot().

ts.plot(cbind(mod.arima.r.1$residuals, mod.arima.r.2$residuals), col=4:3, xlab="Time", ylab="Residuals", main= "Residuals for Two Moving Average Models") legend("topright", c("With Regression Coefficient", "Without Regression Coefficient"), fill=4:3, cex=.7)

Below is a plot of the residuals for the two competing models. To me the residuals look more even for the model given above.

The estimates indicate that the intercept is about 624 divorces; that the number of divorces increases by about 5.4 for every one hundred day increase in the time after the midpoint over the shortest arc of the longitudes of Juno, Ceres, Vesta, and Pallas shifts by 90 or 180 degrees; that errors tend to reverse in direction over one lag and thirteen lags, but tend to be in the same direction over three lags.

Below are some diagnostic plots; the plot of the residuals versus the fitted values, the Q-Q plot of the residuals versus normal variates, and a time series plot of the actual values and the fitted values.

First the graphic screen is set up to plot three plots in a row. Next the residuals are plotted against the fitted values. Next the residuals are plotted against standard normal percentiles. Last the data is plotted with the fitted values using `ts.plot()`

.

par(mfrow=c(1,3)) plot(as.numeric(div.a.ts- mod.arima.r.1$residuals), as.numeric(mod.arima.r.1$residuals), main="Plot of Residuals versus Fitted Values for the Adjusted Divorce Data", xlab="Fitted Value", ylab="Residual", font.main=1) qqnorm(mod.arima.r.1$residuals, main="QQ Normal Plot for Residuals from the Divorce Model", font.main=1) ts.plot(div.a.ts,div.a.ts- mod.arima.r.1$residuals, ylab="Counts", xlab="Year", main="Adjusted Divorce Counts and Fitted Counts", col=4:3) legend("bottomleft", c("Adjusted Divorce Counts","Fitted Adjusted Divorce Counts"), col=4:3, lwd=2, cex=.7) cor.test(div.a.ts-mod.arima.r.1$residuals, mod.arima.r.1$residuals)

The residual versus fitted value plot looks like there is some correlation between the residuals and the fitted values. Testing the correlation gives a p-value of 0.0112, so the model is not ideal. While the model does not pick up most of the extreme data points, the residual and QQ normal plots do not show any great deviations from a model of normal errors with constant variance. The fitted values were found by subtracting the residuals from the actual data points.

The correlation between the fitted values and the residuals is found and tested for equality with zero using `cor.test()`

cor.test(div.a.ts-mod.arima.r.1$residuals, mod.arima.r.1$residuals)Pearson's product-moment correlation data: div.a.ts - mod.arima.r.1$residuals and mod.arima.r.1$residuals t = 2.6651, df = 38, p-value = 0.01123 alternative hypothesis: true correlation is not equal to 0 95 percent confidence interval: 0.09736844 0.63041834 sample estimates: cor 0.3968411

Here is the function I created to compare models. The function is set up to only do moving averages. It takes a while to run with ‘or’ equal to 13 for moving averages since 2 to the power of 13, which equals 8,192, arima()’s are run, since each element of fixed can take on either ‘0’ or ‘NA’. If both autoregressive and moving average models are run, then for ‘or’ equal to 13, the function would have to fit 4 to the power of 13 models to get all combinations or 67,108,864 models, which is beyond the capacity of R, at least on my machine.

arima.comp=function(ts=div.a.ts, phi=vulcan.ts, or=3) { phi <- as.matrix(phi) if (nrow(phi)==1) phi <- t(phi) fx.lst <- list(0:1) #for (i in 2:(2*or)) fx.lst <- #c(fx.lst,list(0:1)) for (i in 2:or) fx.lst <- c(fx.lst,list(0:1)) fx.mat <- as.matrix(expand.grid(fx.lst)) #fx.phi1 <- matrix(NA, 2^(2*or), # 1+ncol(phi)) fx.phi1 <- matrix(NA, 2^or, 1+ncol(phi)) fx.mat[fx.mat==1] <- NA fx.mat1 <- cbind(fx.mat,fx.phi1) fx.mat2 <- cbind(fx.mat,fx.phi1[,1]) #res1 <- numeric(2^(2*or)) #res2 <- numeric(2^(2*or)) res1 <- numeric(2^or) res2 <- numeric(2^or) #for (i in 1:2^(2*or)) { for (i in 1:2^or) { #res1[i] <- arima(ts, order=c(or,0,or), # xreg=phi, fixed=fx.mat1[i,])$aic #res2[i] <- arima(ts, order=c(or,0,or), # fixed=fx.mat2[i,])$aic res1[i] <- arima(ts, order=c(0,0,or), xreg=phi, fixed=fx.mat1[i,])$aic res2[i] <- arima(ts, order=c(0,0,or), fixed=fx.mat2[i,])$aic } st1 <- order(res1) st1 <- st1[1:min(length(st1),10)] print(res1[st1]) print(fx.mat1[st1,]) st2 <- order(res2) st2 <- st2[1:min(length(st2),10)] print(res2[st2]) print(fx.mat2[st2,]) }

In the function, ‘ts’ is the time series to be modeled, ‘phi’ is the vector (matrix) of regressor(s), and ‘or’ is the highest moving average order to be included. To run just autoregressive models, in the calls to arima(), move ‘or’ from the third element of ‘order’ to the first element of ‘order’. In order to run both autoregressive and moving average models, remove the pound signs and add pound signs below where the pound signs were.

In the function, first ‘phi’ is made into a matrix, if a vector. Next the matrices ‘fx.mat1’ and ‘fx.mat2’ are created with the vectors to be assigned to ‘fixed’ in the rows. ‘fx.mat1’ contains regressors and ‘fx.mat2’ does not. Next ‘res1’ and ‘res2’ are created to hold the values of the Aktaike criteria for each model; ‘res1’ for models with a regressor and ‘res2’ for models without a regressor. Next the arima models are fit in a for() loop. Then the numeric order of the elements in ‘res1’ and ‘res2’ are found using order() and only the ten models with the smallest Aktaike criteria are kept, for each of ‘res1’ and ‘res2’. The variables ‘st1’ and ‘st2’ contain the indices of the models with the smallest Aktaike criteria. Last, the values for the ten smallest Aktaike criteria and the values for ‘fixed’ for the ten smallest Aktaike criteria are printed, for models both with and without the regressor.

In this part, we have fit a moving average model with a regressor variable to monthly divorce count data from the state of Iowa. The sample size is just 40 observations.

After taking into consideration of the autocorrelations between the observations, we see that divorce counts increase as time passes after a shift in the astronomical point that I call Vulcan – the mean over the shortest arc of the asteroids Juno, Pallas, and Vesta, and the dwarf planet Ceres. The sample size is quite small, so we would not expect a strong effect (highly significant) but the results Indicate that there is some support for the hypothesis that the astronomical point Vulcan is related to divorce counts. Further testing is warranted.

In the next part, we use the covariance structure of the model selected in this part to estimate standard errors for differing linear combinations of the monthly counts. The results are interpreted from an astrological point of view.

Related Post

- Regression model with auto correlated errors – Part 3, some astrology
- Regression model with auto correlated errors – Part 1, the data
- Linear Regression from Scratch in R
- R for Publication by Page Piccinini: Lesson 5 – Analysis of Variance (ANOVA)
- R for Publication by Page Piccinini: Lesson 4 – Multiple Regression

Related Post

- Regression model with auto correlated errors – Part 3, some astrology
- Regression model with auto correlated errors – Part 2, the models
- Linear Regression from Scratch in R
- R for Publication by Page Piccinini: Lesson 5 – Analysis of Variance (ANOVA)
- R for Publication by Page Piccinini: Lesson 4 – Multiple Regression

For some years now, I have been interested in the possible astrological influence of a composite astronomical point, the mean over the shortest arc of the celestial longitudes of the celestial bodies Juno, Ceres, Vesta, and Pallas. I wanted to see if the point affected divorce rates. In astrology, planets rule zodiacal signs. In older systems of astrology, from before the discoveries of the outer planets, each of the five known planets (Mercury, Venus, Mars, Jupiter, and Saturn – we exclude the earth) ruled two signs. As Uranus, Neptune, and Pluto were discovered, the planets were given rulership over one of Saturn’s, Jupiter’s, and Mars’s signs, respectively. In 1977, the celestial body Chiron was discovered. Some astrologers give Chiron the rulership of one of Mercury’s signs, which would leave Venus as the only planet with dual rulership.

The celestial bodies Juno, Ceres, Vesta, and Pallas were discovered in the early 1800’s. Referencing Roman mythology, some astrologers feel that the discoveries of Juno – the wife, Ceres – the mother, Vesta – the sister, and Pallas – the daughter synchronized with the emergence of the struggle for the full participation of women within society. Venus (which, from the above, is for some astrologers the only planet that still rules two signs) is one of the traditional rulers of women in astrology. The two signs that Venus rules are those associated with money and marriage. I had thought that the mean of the celestial longitudes over of the shortest arc connecting Juno, Ceres, Vesta, and Pallas might be the ruler of the sign of marriage, since Juno, Ceres, Vesta, and Pallas refer to female archetypes. Then Venus would just rule the sign of money. The mean over the shortest arc jumps 90 degrees or 180 degrees from time to time. To test the hypothesis, I had thought that the jump point of the mean over the shortest arc might be related to divorce. An ephemeris for the average, which I call Vulcan – as well as a point I call Cent, from 1900 to 2050, can be found here – here.

I created a data vector of monthly divorce counts in Iowa from August of 2006 to November of 2009 and a vector based on the jump points of the mean over the shortest arc of the celestial longitudes of Juno, Vesta, Ceres, and Pallas. The divorce counts were adjusted for the number of business days in a month. The adjusted divorce counts are the dependent variable. The independent variables are the usual column of ones to fit a mean and a column which linearly increases as time increases after a jump point and which is set back to zero at each jump point.

First enter monthly divorce counts for Iowa from 8/06 to 11/09

div.ts = ts(c(619, 704, 660, 533, 683, 778, 541, 587, 720, 609, 622, 602, 639, 599, 657, 687, 646, 608, 571, 768, 703, 638, 743, 693, 624, 694, 660, 522, 639, 613, 446, 863, 524, 564, 716, 635, 649, 565, 567, 626), start=c(2006,8), freq=12)

Next enter the break point information: the number of days from the last shift

vulcan.ts = ts(c(63, 94, 124, 155, 185, 216, 247, 275, 306, 336, 367, 397, 428, 459, 489, 520, 550, 581, 612, 641, 672, 702, 733, 763, 19, 50, 80, 4, 34, 65, 96, 124, 155, 185, 216, 246, 277, 308, 3, 34), start=c(2006,8), freq=12)

The divorce data is from the Center for Disease Control and Prevention website, from the National Vital Statistics Report page. I took marriage and divorce data for Iowa from the reports from August of 2006 to November of 2009. The divorce data for April of 2008 was 161, which was unusually small. In my analyses, I replaced the number with 703, the average of the numbers on either side.

The average of the longitudes of the four major asteroids, Juno, Ceres, Pallas, and Vesta (yes, I know Ceres is now a dwarf planet) along the shortest arc will jump by 90 or 180 degrees from time to time. The variable I use is the time in days from the last day in which a jump occurred. Since the marriage and divorce data is monthly, each data point is assumed to be at the 15th of the month.

The question of interest is, are the jump points associated the divorce counts in a linear way? Since the months are not of equal length, I have adjusted the divorce counts by dividing the divorce count for a month by the number of state business days in the month and multiplying the result by the average number of state business days in a month over the observed months.

First create a vector of days of the week, with 1 for Sunday, 2 for Monday, etc. 8/06 started on a Tuesday.

wks = c(3:7, rep(1:7, 173), 1:2)

Next create a daily vector for the months, with 1 for for the first month, 2 for the second month, etc.

mns = c(8:12, rep(1:12,2), 1:11) mns.d = 1:40 mn = c(31,28,31,30,31,30,31,31,30,31,30,31) mns.dys = rep(mns.d,mn[mns])

Add 2008 leap day

mns.dys = c(mns.dys[1:553], 19, mns.dys[554:1217])

Next remove Sundays and Saturdays

mns.dys.r = mns.dys[wks>1 & wks<7]

Next use `table()`

to count the weekdays in each month

mns.dys.s = c(table(mns.dys.r))

Next enter the number of government holidays in each of the 40 months

hol = c(0, 1, 0, 3, 1, 2, 0, 0, 0, 1, 0, 1, 0, 1, 0, 3, 1, 2, 0, 0, 0, 1, 0, 1, 0, 1, 0, 3, 1, 2, 0, 0, 0, 1, 0, 1, 0, 1, 0, 3)

Next subtract out the holidays

mns.dys.s = mns.dys.s - hol

Next adjust the divorce counts

div.a.ts = div.ts * mean(mns.dys.s)/mns.dys.s

The holidays I excluded are New Year’s Day, Martin Luther King Day, Memorial Day, Independence Day, Labor Day, Veteran’s Day, the two days for Thanksgiving, and Christmas Day, as these days were state holidays in Iowa in 2012.

The number of state business days that I found for each month are given in the table below.

Jan | Feb | Mar | Apr | May | Jun | Jul | Aug | Sep | Oct | Nov | Dec | |

2006 | 23 | 20 | 22 | 19 | 20 | |||||||

2007 | 21 | 20 | 22 | 21 | 22 | 21 | 21 | 23 | 19 | 23 | 19 | 20 |

2008 | 21 | 21 | 21 | 22 | 21 | 21 | 22 | 21 | 21 | 23 | 17 | 22 |

2009 | 20 | 20 | 22 | 22 | 20 | 22 | 22 | 21 | 21 | 22 | 18 |

There are 1218 days in the 40 months. The divorce counts are adjusted to reflect the number of days that the county offices are open.

Since the dependent variable is a time series, one could expect autocorrelation in the error terms which may need to be taken into account. Below, I have plotted the adjusted divorce counts and the jump point variable.

Next plot the two data vectors:

data.d.v = cbind( "jump point variable"=vulcan.ts, "adjusted divorce count"=div.a.ts) plot(data.d.v, main=" Adjusted Divorce Counts in Iowa with the Jump Point Variable August 2006 to November 2009")

A relationship between the two plots is not immediately evident. Therefore, in the second part we will evaluate the associations by fitting a regression model.

Related Post

- Regression model with auto correlated errors – Part 3, some astrology
- Regression model with auto correlated errors – Part 2, the models
- Linear Regression from Scratch in R
- R for Publication by Page Piccinini: Lesson 5 – Analysis of Variance (ANOVA)
- R for Publication by Page Piccinini: Lesson 4 – Multiple Regression

Related Post

- Regression model with auto correlated errors – Part 3, some astrology
- Regression model with auto correlated errors – Part 2, the models
- Regression model with auto correlated errors – Part 1, the data
- R for Publication by Page Piccinini: Lesson 5 – Analysis of Variance (ANOVA)
- R for Publication by Page Piccinini: Lesson 4 – Multiple Regression

`lm()`

function in R, which allows us to perform linear regression quickly and easily. But one drawback to the `lm()`

function is that it takes care of the computations to obtain parameter estimates (and many diagnostic statistics, as well) on its own, leaving the user out of the equation. So, how can we obtain these parameter estimates from scratch?
In this post, I will outline the process from first principles in R. I will use only matrices, vectors, and matrix operations to obtain parameter estimates using the closed-form linear algebraic solution. After reading this post, you’ll see that it’s actually quite simple, and you’ll be able to replicate the process with your own data sets (though using the `lm()`

function is of course much more efficient and robust).

For this example, I’ll use the well-known “Boston” data set from the MASS library. For those who aren’t familiar with it, the Boston data set contains 14 economic, geographic, and demographic variables for 506 tracts in the city of Boston from the early 1990s. Our target variable will be the median home value for each tract — the column titled ‘medv.’ Let’s get an idea of what this data set looks like:

library(MASS) str(Boston)## 'data.frame': 506 obs. of 14 variables: ## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ... ## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ... ## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ... ## $ chas : int 0 0 0 0 0 0 0 0 0 0 ... ## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ... ## $ rm : num 6.58 6.42 7.18 7 7.15 ... ## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ... ## $ dis : num 4.09 4.97 4.97 6.06 6.06 ... ## $ rad : int 1 2 2 3 3 3 5 5 5 5 ... ## $ tax : num 296 242 242 222 222 222 311 311 311 311 ... ## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ... ## $ black : num 397 397 393 395 397 ... ## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ... ## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...

To learn more about the definition of each variable, type `help(Boston)`

into your R console.

Now we’re ready to start. Linear regression typically takes the form

$$y=\beta X + \epsilon$$where ‘y’ is a vector of the response variable, ‘X’ is the matrix of our feature variables (sometimes called the ‘design’ matrix), and β is a vector of parameters that we want to estimate. \(\epsilon\) is the error term; it represents features that affect the response, but are not explicitly included in our model.

The first thing we will need is a vector of our response variable, typically called ‘y’. In this case, our response variable is the median home value in thousands of US dollars: ‘medv’:

y <- Boston$medv

Next, we’ll need our ‘X’ or ‘design’ matrix of the input features. We can do this with the `as.matrix()`

function, grabbing all the variables except ‘medv’ from the Boston data frame. We’ll also need to add a column of ones to our X matrix for the intercept term.

# Matrix of feature variables from Boston X <- as.matrix(Boston[-ncol(Boston)]) # vector of ones with same length as rows in Boston int <- rep(1, length(y)) # Add intercept column to X X <- cbind(int, X)

Now that we have our response vector and our ‘X’ matrix, we can use them to solve for the parameters using the following closed form solution:

$$\beta = (X^TX)^{-1}X^Ty$$The derivation of this equation is beyond the scope of this post. If you are interested in the derivation, a good place to start is Wikipedia or any good undergraduate textbook on linear regression.

We can implement this in R using our ‘X’ matrix and ‘y’ vector. The `%*%`

operator is simply matrix multiplication. The `t()`

function takes the transpose of a matrix, and `solve()`

calculates the inverse of any (invertible) matrix.

# Implement closed-form solution betas <- solve(t(X) %*% X) %*% t(X) %*% y # Round for easier viewing betas <- round(betas, 2)

Now we have our parameter vector *β*. These are the linear relationships between the response variable ‘medv’ and each of the features. For example, we see that an increase of one unit in the number of rooms (rm) is associated with a $3,810 increase in home value. We can find the relationship between the response and any of the feature variables in the same manner, paying careful attention to the units in the data description. Notice that one of our features, ‘chas’, is a dummy variable which takes a value of 0 or 1 depending on whether or not the tract is adjacent to the Charles River. The coefficient of ‘chas’ tells us that homes in tracts adjacent to the Charles River (coded as 1) have a median price that is $2,690 higher than homes in tracts that do not border the river (coded as 0) when the other variables are held constant.

print(betas)## [,1] ## int 36.46 ## crim -0.11 ## zn 0.05 ## indus 0.02 ## chas 2.69 ## nox -17.77 ## rm 3.81 ## age 0.00 ## dis -1.48 ## rad 0.31 ## tax -0.01 ## ptratio -0.95 ## black 0.01 ## lstat -0.52

Now let’s verify that these results are in fact correct. We want to compare our results to those produced by the `lm()`

function. Most of you will already know how to do this.

# Linear regression model lm.mod <- lm(medv ~ ., data=Boston) # Round for easier viewing lm.betas <- round(lm.mod$coefficients, 2) # Create data.frame of results results <- data.frame(our.results=betas, lm.results=lm.betas) print(results)## our.results lm.results ## int 36.46 36.46 ## crim -0.11 -0.11 ## zn 0.05 0.05 ## indus 0.02 0.02 ## chas 2.69 2.69 ## nox -17.77 -17.77 ## rm 3.81 3.81 ## age 0.00 0.00 ## dis -1.48 -1.48 ## rad 0.31 0.31 ## tax -0.01 -0.01 ## ptratio -0.95 -0.95 ## black 0.01 0.01 ## lstat -0.52 -0.52

The parameter estimates are identical! Now you know how to obtain parameter estimates on your own. Keep in mind that you’re unlikely to favor implementing linear regression in this way over using `lm()`

. The `lm()`

function is very quick, and requires very little code. Using it provides us with a number of diagnostic statistics, including \(R^2\), t-statistics, and the oft-maligned p-values, among others. It also solves for the parameters using QR decomposition, which is more robust than the method I’ve presented here. Nevertheless, I wanted to show one way in which it can be done.

There are other ways to solve for the parameters in linear regression. For example, gradient descent can be used to obtain parameter estimates when the number of features is extremely large, a situation that can drastically slow solution time when using the closed-form method. Stochastic gradient descent is often used when both the number of features and the number of samples are very large. I hope to detail these in a future post.

I have made the code from this post available at my Github here.

Cheers!

Related Post

- Regression model with auto correlated errors – Part 3, some astrology
- Regression model with auto correlated errors – Part 2, the models
- Regression model with auto correlated errors – Part 1, the data
- R for Publication by Page Piccinini: Lesson 5 – Analysis of Variance (ANOVA)
- R for Publication by Page Piccinini: Lesson 4 – Multiple Regression

Related Post

The shiny app is available on my site, but even better, the code is on github for you to run locally or improve! Let me give you a quick tour of the app in this post. If you prefer, I have also posted a video that provides background on the app. Another tutorial how to build a interactive web apps with shiny is published at DataScience+.

**The available algorithms include:**

– Hierarchical Clustering (DMwR)

– Kmeans Euclidean Distance

– Kmeans Mahalanobis

– Kmeans Manhattan

– Fuzzy kmeans – Gustafson and Kessel

– Fuzzy k-medoids

– Fuzzy k-means with polynomial fuzzifier

– Local Outlier Factor (dbscan)

– RandomForest (proximity from randomForest)

– Isolation Forest (IsolationForest)

– Autoencoder (Autoencoder)

– FBOD and SOD (HighDimOut)

There are also a wide range of datasets to try as well. They include randomly scattered points, defined clusters, and some more unusual patterns like the smiley face and spirals. Additionally, you can use your mouse to add and/or remove points by clicking directly within the visualization. This allows you to create your own dataset.

Once the data is loaded, you can start exploring. One thing you can do is look at the effect scaling can have. In this example, you can see how outliers differ when scaling is used with kmeans. The values on the far right no longer dominate the distance measurements, and there are now outliers from other areas:

It quickly becomes apparent that different algorithms may select different outliers. In this case, you see a difference between the outliers selected using an autoencoder versus isolation forest.

Another example here is the difference between chosen outliers using kmeans versus fuzzy kmeans:

A density based algorithm can also select different outliers versus a distance based algorithm. This example nicely shows the difference between kmeans and lof (local outlier factor from dbscan)

An important part of using this visualization is studying the distance numbers that are calculated. Are these numbers meshing with your intuition? How big of a quantitative difference is there between outliers and other points?

The next thing is whether to expand this to larger datasets. This is something that you would run locally (large datasets take too long to run for my shiny server). The downside of larger datasets is that it gets tricker to visualize them. For now, I am using a TSNE plot. I am open to suggestions, but the intent here is a way to evaluate outlier algorithms on a variety of datasets.

The source code for the outlier app is on github. The app is built off a variety of R packages and could easily be extended to new packages or incorporate additional datasets. Please send me bug fixes, additional algorithms, tighter code, or ideas for improving the app.

Related Post