Related Post
]]>Support Vector Classifiers are majorly used for solving a binary classification problems where we only have 2 class labels say \(Y=[−1,1]\) and a bunch of predictors \(X_i\).And what SVM does is that it generates hyperplanes which in simple terms are just straight lines or planes or are non-linear curves,and these lines are used to separate the data or divide the data into 2 categories or more depending on the type of classification problem.
Another important concept in SVM is of maximal margin classifiers.What it means is that amongst a set of separating hyperplanes SVM aims at finding the one which maximizes the margin \(M\).This simply means that we want to maximize the gap or the distance between the 2 classes from the decision boundary(separating plane).This concept of separating data linearly into 2 different classes using a linear separator or a straight linear line is called linear separability.
The term support vectors in SVM are the data points or training examples which are used to define or maximizing the margin.The support vectors are the points which are close to the decision boundary or on the wrong side of the boundary.
But it is often encountered that linear separators and linear decision boundaries fail because of the non linear interactions in the data and the non linear dependence between the features in feature space.
In this tutorial I am going to talk about generating non linear decision boundaries which is able to separate non linear data using radial kernel support vector classifier.
So how do we separate non linear data?
The trick here is by doing feature expansion.So the way we solve this problem is by doing a non linear transformation on the features \(X_i\) and converting them to a higher dimentional space(say 2-D to 3-D space) called a feature space.Now by this transformation we are able to separate non linear data using a non linear decision boundary.
One famous and most general way of adding non linearities in a model to capture non linear interactions is by simply using higher degree terms such as square and cubic polynomial terms.
$$y_i = \beta_0+ \beta_1X_1 + \beta_2X_1^2+ \beta_3X_2 + \beta_4X_2^2 + \beta_5X_2^3…. = 0 $$
Above equation is the equation of the non linear hyperplane which is generated if we use higher degree polynomial
terms to fit to the data to get a non linear decision boundary.
What we are actually doing is that we are fitting a SVM is an enlarged space.We enlarge the space of features by fitting non linear functions in predictors \(X_i\). But the problem with polynomials are that in higher dimention i.e when having lots of predictors it gets wild and generally overfits at higher degrees of polynomial.
Hence there is another elegant way of adding non linearities in SVM is by the use of Kernel trick.
Kernel function is a function of form–
$$K(x,y) = (1 + \sum_{j=1}^{p} x_{ij} y_{ij})^d $$ , where d is the degree of polynomial.
Now the type of Kernel function we are going to use here is a Radial kernel.It is of form-
$$K(x,y) = exp(-\gamma \sum_{j=1}^{p}(x_{ij} – y_{ij})^2 )$$ , and \(\gamma\) here is a tuning parameter which accounts for the smoothness of the decision boundary and controls the variance of the model.
If \(\gamma\) is very large then we get quiet fluctuating and wiggly decision boundaries which accounts for high variance and overfitting.
If \(\gamma\) is small,the decision line or boundary is smoother and has low variance.
So now the equation of the support vector classifier becomes —
$$f(x) = \beta_0 + \sum_{i \in S} \alpha_i K(x_i,y_i)$$
Here S are the support vectors and \(\alpha\) is simply a weight value which is non-zero for all support vectors and otherwise 0.
I will use the ‘e1071’ package to implement Radial SVM in R.
require(e1071) require(ElemStatLearn)#package containing the dataset #Loading the data attach(mixture.example) #is just a simulated mixture data with 200 rows and 2 classes names(mixture.example) ## [1] "x" "y" "xnew" "prob" "marginal" "px1" ## [7] "px2" "means"
The following data is also 2-D , so lets plot it.The 2 different colored points represent the different classes.
plot(x,col=y+3)
#converting data to a data frame data<-data.frame(y=factor(y),x) head(data) ## y X1 X2 ## 1 0 2.52609297 0.3210504 ## 2 0 0.36695447 0.0314621 ## 3 0 0.76821908 0.7174862 ## 4 0 0.69343568 0.7771940 ## 5 0 -0.01983662 0.8672537 ## 6 0 2.19654493 -1.0230141
Now let’s fit a Radial kernel using svm()
function.
Radialsvm=svm(factor(y) ~ .,data=data,kernel="radial",cost=5,scale=F) Radialsvm ## ## Call: ## svm(formula = factor(y) ~ ., data = data, kernel = "radial", ## cost = 5, scale = F) ## ## ## Parameters: ## SVM-Type: C-classification ## SVM-Kernel: radial ## cost: 5 ## gamma: 0.5 ## ## Number of Support Vectors: 103
The output of the model is a non linear decision boundary and the values of the tuning parameters \(C,\gamma\) and the number of support vectors too.
Let’s generate a Confusion matrix to check the model’s performance.
#Confusion matrix to ckeck the accuracy table(predicted=Radialsvm$fitted,actual=data$y) ## actual ## predicted 0 1 ## 0 76 10 ## 1 24 90
let’s calculate the mis-classification rate–
#misclassification Rate mean(Radialsvm$fitted!=data$y)*100 #17% wrong predictions ## [1] 17%
Now let’s create a grid and make prediction on those grid values and also generate a plot with the decision boundary.
xgrid=expand.grid(X1=px1,X2=px2) #generating grid points ygrid=predict(Radialsvm,newdata = xgrid) #ygird consisting of predicted Response values #lets plot the non linear decision boundary plot(xgrid,col=as.numeric(ygrid),pch=20,cex=0.3) points(x,col=y+1,pch=19) #we can see that the decision boundary is non linear
Now we can also improve the above plot,by actually including the decision boundary using the contour() function.
Now the code below is a little bit confusing , so i urge the readers to use the help documentation of the below used functions-
func = predict(Radialsvm,xgrid,decision.values = TRUE) func=attributes(func)$decision #to pull out all the attributes and use decision attr #let's again plot the decision boundary plot(xgrid,col=as.numeric(ygrid),pch=20,cex=0.3) points(x,col=y+1,pch=19) ?coutour #to search more on using this function contour(px1,px2,matrix(func,69,99),level=0,add=TRUE,lwd=3) #adds the non linear decision boundary contour(px1,px2,matrix(prob,69,99),level=0.5,add=T,col="blue",lwd=3)#this is the true decision boundary i.e Bayes decision boundary legend("topright",c("True Decision Boundary","Fitted Decision Boundary"),lwd=3,col=c("blue","black"))
Gives output :
The above plot shows us the tradeoffs between the true bayes decision boundary and the fitted decision boundary generated by the radial kernel by learning from data. Both look quiet similar and seems that SVM has done a good functional approximation of the actual true underlying function.
Radial kernel support vector machine is a good approch when the data is not linearly separable.The idea behind generating non linear decision boundaries is that we need to do some non linear transformations on the features X\(_i\) which transforms them to a higher dimentional space.We do this non linear transformation using the Kernel trick.Now the performance of SVM are influenced by the values of the tuning parameters.Now there are 2 Tuning parameters in the SVM i.e the regularization parameter \(C\) and \(\gamma\).We can implement cross validation to find the best values of both these tuning parameters which affect our classifier’s \(C(X)\) performance.Another way of finding the best value for these hyper-parameters are by using certain optimization techniques such as Bayesian Optimization.
Hope you guys liked the article and make sure to like and share it with your peers.
I am sure that it is motivating enough to get you started with generating your own support vector classifier and use it’s amazing power to generate non linear decision boundaries and learn non-linear interactions from data.
Related Post
]]>Related Post
]]>The idea is that instead of producing a single complicated and complex Model which might have a high variance which will lead to Overfitting or might be too simple and have a high bias which leads to Underfitting, we will generate lots of Models by training on Training Set and at the end combine them. Such a technique is Random Forest which is a popular Ensembling technique is used to improve the predictive performance of Decision Trees by reducing the variance in the Trees by averaging them. Decision Trees are considered very simple and easily interpretable as well as understandable Modelling techniques, but a major drawback in them is that they have a poor predictive performance and poor Generalization on Test Set. They are also referred to as Weak Learners which are Learners which always perform better than chance and have an error less than 50 %.
Random Forests are similar to a famous Ensemble technique called Bagging but have a different tweak in it. In Random Forests the idea is to decorrelate the several trees which are generated on the different bootstrapped samples from training Data.And then we simply reduce the Variance in the Trees by averaging them.
Averaging the Trees helps us to reduce the variance and also improve the Perfomance of Decision Trees on Test Set and eventually avoid Overfitting.
The idea is to build lots of Trees in such a way to make the Correlation between the Trees smaller.
Another major difference is that we only consider a Random subset of predictors \( m \) each time we do a split on training examples.Whereas usually in Trees we find all the predictors while doing a split and choose best amongst them. Typically \( m=\sqrt{p} \) where \(p\) are the number of predictors.
Now it seems crazy to throw away lots of predictors, but it makes sense because the effect of doing so is that each tree uses different predictors to split data at various times.This means that 2 trees generated on same training data will have randomly different variables selected at each split,hence this is how the trees will get de-correlated and will be independent of each other.Another great thing about Random Forests and Bagging is that we can keep on adding more and more big bushy trees and that won’t hurt us because at the end we are just going to average them out which will reduce the variance by the factor of the number of Trees \(T\) itself.
So by doing this trick of throwing away Predictors, we have de-correlated the Trees and the resulting average seems a little better.
loading the required packages
require(randomForest) require(MASS)#Package which contains the Boston housing dataset attach(Boston) set.seed(101) dim(Boston) ## [1] 506 14
#training Sample with 300 observations train=sample(1:nrow(Boston),300) ?Boston #to search on the dataset
We are going to use variable ′medv′ as the Response variable, which is the Median Housing Value. We will fit 500 Trees.
We will use all the Predictors in the dataset.
Boston.rf=randomForest(medv ~ . , data = Boston , subset = train) Boston.rf ## ## Call: ## randomForest(formula = medv ~ ., data = Boston, subset = train) ## Type of random forest: regression ## Number of trees: 500 ## No. of variables tried at each split: 4 ## ## Mean of squared residuals: 12.62686 ## % Var explained: 84.74
The above Mean Squared Error and Variance explained are calculated using Out of Bag Error Estimation.In this \(\frac23\) of Training data is used for training and the remaining \( \frac13 \) is used to Validate the Trees. Also, the number of variables randomly selected at each split is 4.
Plotting the Error vs Number of Trees Graph.
plot(Boston.rf)
This plot shows the Error and the Number of Trees.We can easily notice that how the Error is dropping as we keep on adding more and more trees and average them.
The above Random Forest model chose Randomly 4 variables to be considered at each split. We could now try all possible 13 predictors which can be found at each split.
oob.err=double(13) test.err=double(13) #mtry is no of Variables randomly chosen at each split for(mtry in 1:13) { rf=randomForest(medv ~ . , data = Boston , subset = train,mtry=mtry,ntree=400) oob.err[mtry] = rf$mse[400] #Error of all Trees fitted pred<-predict(rf,Boston[-train,]) #Predictions on Test Set for each Tree test.err[mtry]= with(Boston[-train,], mean( (medv - pred)^2)) #Mean Squared Test Error cat(mtry," ") #printing the output to the console } ## 1 2 3 4 5 6 7 8 9 10 11 12 13
Test Error
test.err ## [1] 26.06433 17.70018 16.51951 14.94621 14.51686 14.64315 14.60834 ## [8] 15.12250 14.42441 14.53687 14.89362 14.86470 15.09553
Out of Bag Error Estimation
oob.err ## [1] 19.95114 13.34894 13.27162 12.44081 12.75080 12.96327 13.54794 ## [8] 13.68273 14.16359 14.52294 14.41576 14.69038 14.72979
What happens is that we are growing 400 trees for 13 times i.e for all 13 predictors.
matplot(1:mtry , cbind(oob.err,test.err), pch=19 , col=c("red","blue"),type="b",ylab="Mean Squared Error",xlab="Number of Predictors Considered at each Split") legend("topright",legend=c("Out of Bag Error","Test Error"),pch=19, col=c("red","blue"))
Now what we observe is that the Red line is the Out of Bag Error Estimates and the Blue Line is the Error calculated on Test Set. Both curves are quite smooth and the error estimates are somewhat correlated too. The Error Tends to be minimized at around \( mtry = 4 \) .
On the Extreme Right Hand Side of the above plot, we considered all possible 13 predictors at each Split which is only Bagging.
Now in this article, I gave a simple overview of Random Forests and how they differ from other Ensemble Learning Techniques and also learned how to implement such complex and Strong Modelling Technique in R with a simple package randomForest
.Random Forests are a very Nice technique to fit a more Accurate Model by averaging Lots of Decision Trees and reducing the Variance and avoiding Overfitting problem in Trees. Decision Trees themselves are poor performance wise, but when used with Ensembling Techniques like Bagging, Random Forests etc, their predictive performance is improved a lot.Now obviously there are various other packages in R which can be used to implement Random Forests.
I hope the tutorial is enough to get you started with implementing Random Forests in R or at least understand the basic idea behind how this amazing Technique works.
Thanks for reading the article and make sure to like and share it.
Related Post
]]>Related Post
]]>Not surprisingly, we learn that House Stark (specifically Ned and Sansa) and House Lannister (especially Tyrion) are the most important family connections in Game of Thrones; they also connect many of the storylines and are central parts of the narrative.
A network in this context is a graph of interconnected nodes/vertices. Nodes can e.g. be people in a social network, genes in a co-expression network, etc. Nodes are connected via ties/edges.
Network analysis can e.g. be used to explore relationships in social or professional networks. In such cases, we would typically ask questions like:
These answers can give us a lot of information about the patterns of how people interact.
The basis for this network is Kaggle’s Game of Throne dataset (character-deaths.csv). Because most family relationships were missing in that dataset, I added the missing information in part by hand (based on A Wiki of Ice and Fire) and by scraping information from the Game of Thrones wiki. You can find the full code for how I generated the network on my Github page.
library(tidyverse) library(igraph) library(statnet) load("union_edges.RData") load("union_characters.RData")
I am using igraph to plot the initial network. To do so, I first create the graph from the edge- and node-table. An edge-table contains source and target nodes in the first two columns and optionally additional columns with edge attributes. Here, I have the type of interaction (mother, father or spouse), the color and line-type I want to assign to each edge.
Because the books and the TV series differ slightly, I have introduced edges that are only supported or hinted at by the TV series and are not part of the original narrative in the books. These edges are marked by being dotted instead of solid. An additional color for edges with unspecified parental origin are introduced as well. Originally, these served for interactions that were extracted from character names (i.e. characters that ended with “… son/daughter of …”) and could either mean mother or father. Now, they show unclear parentage or cases where there are a biological and a de facto father, as in the case of Jon Snow.
head(union_edges) ## source target type color lty ## 1 Lysa Arryn Robert Arryn mother #7570B3 solid ## 2 Jasper Arryn Alys Arryn father #1B9E77 solid ## 3 Jasper Arryn Jon Arryn father #1B9E77 solid ## 4 Jon Arryn Robert Arryn father #1B9E77 solid ## 110 Cersei Lannister Tommen Baratheon mother #7570B3 solid ## 210 Cersei Lannister Joffrey Baratheon mother #7570B3 solid
The nodetable contains one row for each character that is either a source or a target in the edgetable. We can give any number and type of node attributes. Here, I chose the following columns from the original Kaggle dataset: gender/male (male = 1, female = 0), house (as the house each character was born into) and popularity. House2 was meant to assign a color to only the major houses. Shape represents the gender.
head(union_characters) ## name male culture house popularity house2 color shape ## 1 Alys Arryn 0 House Arryn 0.08026756 circle ## 2 Elys Waynwood 0 House Waynwood 0.07023411 circle ## 3 Jasper Arryn 1 House Arryn 0.04347826 square ## 4 Jeyne Royce 0 House Royce 0.00000000 circle ## 5 Jon Arryn 1 Valemen House Arryn 0.83612040 square ## 6 Lysa Arryn 0 House Tully 0.00000000 House Tully #F781BF circle
By default, we have a directed graph.
union_graph <- graph_from_data_frame(union_edges, directed = TRUE, vertices = union_characters)
For plotting the legend, I am summarizing the edge and node colors.
color_vertices % group_by(house, color) %>% summarise(n = n()) %>% filter(!is.na(color)) colors_edges % group_by(type, color) %>% summarise(n = n()) %>% filter(!is.na(color))
Now, we can plot the graph object (here with Fruchterman-Reingold layout):
layout <- layout_with_fr(union_graph)
Click on the image to get to the high resolution pdf:
plot(union_graph, layout = layout, vertex.label = gsub(" ", "\n", V(union_graph)$name), vertex.shape = V(union_graph)$shape, vertex.color = V(union_graph)$color, vertex.size = (V(union_graph)$popularity + 0.5) * 5, vertex.frame.color = "gray", vertex.label.color = "black", vertex.label.cex = 0.8, edge.arrow.size = 0.5, edge.color = E(union_graph)$color, edge.lty = E(union_graph)$lty) legend("topleft", legend = c(NA, "Node color:", as.character(color_vertices$house), NA, "Edge color:", as.character(colors_edges$type)), pch = 19, col = c(NA, NA, color_vertices$color, NA, NA, colors_edges$color), pt.cex = 5, cex = 2, bty = "n", ncol = 1, title = "") legend("topleft", legend = "", cex = 4, bty = "n", ncol = 1, title = "Game of Thrones Family Ties")
Node color shows the major houses, node size the character’s popularity and node shape their gender (square for male, circle for female). Edge color shows interaction type.
As we can see, even with only a subset of characters from the Game of Thrones world, the network is already quite big. You can click on the image to open the pdf and zoom into specific parts of the plot and read the node labels/character names.
What we can see right away is that there are only limited connections between houses and that the Greyjoys are the only house that has no ties to any of the others.
How do we find out who the most important characters are in this network?
We consider a character “important” if he has connections to many other characters. There are a few network properties, that tell us more about this. For this, I am considering the network as undirected to account for parent/child relationships as being mutual.
union_graph_undir <- as.undirected(union_graph, mode = "collapse")
Centrality describes the number of edges that are in- or outgoing to/from nodes. High centrality networks have few nodes with many connections, low centrality networks have many nodes with similar numbers of edges.
For the whole network, we can calculate centrality by degree (centr_degree()), closeness (centr_clo()) or eigenvector centrality (centr_eigen()) of vertices.
centr_degree(union_graph_undir, mode = "total")$centralization ## [1] 0.04282795
centr_clo(union_graph_undir, mode = "total")$centralization ## [1] 0.01414082
centr_eigen(union_graph_undir, directed = FALSE)$centralization ## [1] 0.8787532
Node degree or degree centrality describes how central a node is in the network (i.e. how many in- and outgoing edges it has or to how many other nodes it is directly connected via one edge).
union_graph_undir_degree <- igraph::degree(union_graph_undir, mode = "total") #standardized by number of nodes union_graph_undir_degree_std <- union_graph_undir_degree / (vcount(union_graph_undir) - 1) node_degree % tibble::rownames_to_column() union_characters % arrange(-degree) %>% .[1:10, ] ## rowname degree degree_std ## 1 Quellon Greyjoy 12 0.05797101 ## 2 Walder Frey 10 0.04830918 ## 3 Oberyn Martell 10 0.04830918 ## 4 Eddard Stark 9 0.04347826 ## 5 Catelyn Stark 8 0.03864734 ## 6 Emmon Frey 7 0.03381643 ## 7 Genna Lannister 7 0.03381643 ## 8 Merrett Frey 7 0.03381643 ## 9 Balon Greyjoy 7 0.03381643 ## 10 Jason Lannister 7 0.03381643
In this case, the node degree reflects how many offspring and spouses a character had. With 3 wifes and several children, Quellon Greyjoy, the grandfather of Theon and Asha/Yara comes out on top (of course, had I included all offspring and wifes of Walder Frey’s, he would easily be on top but the network would have gotten infinitely more confusing).
The closeness of a node describes its distance to all other nodes. A node with highest closeness is more central and can spread information to many other nodes.
closeness <- igraph::closeness(union_graph_undir, mode = "total") #standardized by number of nodes closeness_std <- closeness / (vcount(union_graph_undir) - 1) node_closeness % tibble::rownames_to_column() union_characters % arrange(-closeness) %>% .[1:10, ] ## rowname closeness closeness_std ## 1 Sansa Stark 0.0002013288 9.726028e-07 ## 2 Tyrion Lannister 0.0002012882 9.724070e-07 ## 3 Tywin Lannister 0.0002011668 9.718201e-07 ## 4 Joanna Lannister 0.0002005616 9.688965e-07 ## 5 Eddard Stark 0.0002002804 9.675381e-07 ## 6 Catelyn Stark 0.0001986492 9.596579e-07 ## 7 Cersei Lannister 0.0001984915 9.588960e-07 ## 8 Jaime Lannister 0.0001975894 9.545382e-07 ## 9 Jeyne Marbrand 0.0001966568 9.500330e-07 ## 10 Tytos Lannister 0.0001966568 9.500330e-07
The characters with highest closeness all surround central characters that connect various storylines and houses in Game of Thrones.
Betweenness describes the number of shortest paths between nodes. Nodes with high betweenness centrality are on the path between many other nodes, i.e. they are people who are key connections or bridges between different groups of nodes. In a social network, these nodes would be very important because they are likely to pass on information to a wide reach of people.
The igraph function betweenness() calculates vertex betweenness, edge_betweenness() calculates edge betweenness:
betweenness <- igraph::betweenness(union_graph_undir, directed = FALSE) # standardize by number of node pairs betweenness_std <- betweenness / ((vcount(union_graph_undir) - 1) * (vcount(union_graph_undir) - 2) / 2) node_betweenness % tibble::rownames_to_column() union_characters % arrange(-betweenness) %>% .[1:10, ] ## rowname betweenness betweenness_std ## 1 Eddard Stark 6926.864 0.3248846 ## 2 Sansa Stark 6165.667 0.2891828 ## 3 Tyrion Lannister 5617.482 0.2634718 ## 4 Tywin Lannister 5070.395 0.2378123 ## 5 Joanna Lannister 4737.524 0.2221999 ## 6 Rhaegar Targaryen 4301.583 0.2017533 ## 7 Margaery Tyrell 4016.417 0.1883784 ## 8 Jon Snow 3558.884 0.1669192 ## 9 Mace Tyrell 3392.500 0.1591154 ## 10 Jason Lannister 3068.500 0.1439191
edge_betweenness % tibble::rownames_to_column() %>% arrange(-betweenness) %>% .[1:10, ] ## rowname edge betweenness ## 1 160 Sansa Stark|Tyrion Lannister 5604.149 ## 2 207 Sansa Stark|Eddard Stark 4709.852 ## 3 212 Rhaegar Targaryen|Jon Snow 3560.083 ## 4 296 Margaery Tyrell|Mace Tyrell 3465.000 ## 5 213 Eddard Stark|Jon Snow 3163.048 ## 6 131 Jason Lannister|Joanna Lannister 3089.500 ## 7 159 Joanna Lannister|Tyrion Lannister 2983.591 ## 8 171 Tyrion Lannister|Tywin Lannister 2647.224 ## 9 192 Elia Martell|Rhaegar Targaryen 2580.000 ## 10 300 Luthor Tyrell|Mace Tyrell 2565.000
This, we can now plot by feeding the node betweenness as vertex.size and edge betweenness as edge.width to our plot function:
plot(union_graph_undir, layout = layout, vertex.label = gsub(" ", "\n", V(union_graph_undir)$name), vertex.shape = V(union_graph_undir)$shape, vertex.color = V(union_graph_undir)$color, vertex.size = betweenness * 0.001, vertex.frame.color = "gray", vertex.label.color = "black", vertex.label.cex = 0.8, edge.width = edge_betweenness * 0.01, edge.arrow.size = 0.5, edge.color = E(union_graph_undir)$color, edge.lty = E(union_graph_undir)$lty) legend("topleft", legend = c("Node color:", as.character(color_vertices$house), NA, "Edge color:", as.character(colors_edges$type)), pch = 19, col = c(NA, color_vertices$color, NA, NA, colors_edges$color), pt.cex = 5, cex = 2, bty = "n", ncol = 1)
Ned Stark is the character with highest betweenness. This makes sense, as he and his children (specifically Sansa and her arranged marriage to Tyrion) connect to other houses and are the central points from which the story unfolds. However, we have to keep in mind here, that my choice of who is important enough to include in the network (e.g. the Stark ancestors) and who not (e.g. the whole complicated mess that is the Targaryen and Frey family tree) makes this result somewhat biased.
In contrast to the shortest path between two nodes, we can also calculate the longest path, or diameter:
diameter(union_graph_undir, directed = FALSE) ## [1] 21
In our network, the longest path connects 21 nodes.
This, we can also plot:
union_graph_undir_diameter <- union_graph_undir node_diameter <- get.diameter(union_graph_undir_diameter, directed = FALSE) V(union_graph_undir_diameter)$color <- scales::alpha(V(union_graph_undir_diameter)$color, alpha = 0.5) V(union_graph_undir_diameter)$size <- 2 V(union_graph_undir_diameter)[node_diameter]$color <- "red" V(union_graph_undir_diameter)[node_diameter]$size <- 5 E(union_graph_undir_diameter)$color <- "grey" E(union_graph_undir_diameter)$width <- 1 E(union_graph_undir_diameter, path = node_diameter)$color <- "red" E(union_graph_undir_diameter, path = node_diameter)$width <- 5 plot(union_graph_undir_diameter, layout = layout, vertex.label = gsub(" ", "\n", V(union_graph_undir_diameter)$name), vertex.shape = V(union_graph_undir_diameter)$shape, vertex.frame.color = "gray", vertex.label.color = "black", vertex.label.cex = 0.8, edge.arrow.size = 0.5, edge.lty = E(union_graph_undir_diameter)$lty) legend("topleft", legend = c("Node color:", as.character(color_vertices$house), NA, "Edge color:", as.character(colors_edges$type)), pch = 19, col = c(NA, color_vertices$color, NA, NA, colors_edges$color), pt.cex = 5, cex = 2, bty = "n", ncol = 1)
“Transitivity measures the probability that the adjacent vertices of a vertex are connected. This is sometimes also called the clustering coefficient.” transitivity() help. We can calculate the transitivity or ratio of triangles to connected triples for the whole network:
transitivity(union_graph_undir, type = "global") ## [1] 0.2850679
Or for each node:
transitivity <- data.frame(name = V(union_graph_undir)$name, transitivity = transitivity(union_graph_undir, type = "local")) %>% mutate(name = as.character(name)) union_characters <- left_join(union_characters, transitivity, by = "name") transitivity %>% arrange(-transitivity) %>% .[1:10, ] ## name transitivity ## 1 Robert Arryn 1 ## 2 Ormund Baratheon 1 ## 3 Selyse Florent 1 ## 4 Shireen Baratheon 1 ## 5 Amarei Crakehall 1 ## 6 Marissa Frey 1 ## 7 Olyvar Frey 1 ## 8 Perra Royce 1 ## 9 Perwyn Frey 1 ## 10 Tion Frey 1
Because ours is a family network, characters with a transitivity of one form triangles with their parents or offspring.
PageRank (originally used by Google to rank the importance of search results) is similar to eigenvector centrality. Eigenvector centrality scores nodes in a network according to the number of connections to high-degree nodes they have. It is therefore a measure of node importance. PageRank similarly considers nodes as more important if they have many incoming edges (or links).
page_rank <- page.rank(union_graph_undir, directed = FALSE) page_rank_centrality <- data.frame(name = names(page_rank$vector), page_rank = page_rank$vector) %>% mutate(name = as.character(name)) union_characters <- left_join(union_characters, page_rank_centrality, by = "name") page_rank_centrality %>% arrange(-page_rank) %>% .[1:10, ] ## name page_rank ## 1 Oberyn Martell 0.018402407 ## 2 Quellon Greyjoy 0.016128129 ## 3 Walder Frey 0.012956029 ## 4 Eddard Stark 0.011725019 ## 5 Cregan Stark 0.010983561 ## 6 Catelyn Stark 0.010555473 ## 7 Lyarra Stark 0.009876629 ## 8 Aegon V Targaryen 0.009688458 ## 9 Balon Greyjoy 0.009647049 ## 10 Jon Arryn 0.009623742
Oberyn Martell, Quellon Greyjoy and Walder Frey all have the highest number of spouses, children and grandchildren are are therefore scored highest for PageRank.
Connections between nodes can also be represented as an adjacency matrix. We can convert our graph object to an adjacency matrix with igraph’s as_adjacency_matrix() function. Whenever there is an edge between two nodes, this field in the matrix will get assigned a 1, otherwise it is 0.
adjacency <- as.matrix(as_adjacency_matrix(union_graph_undir))
We can now calculate the eigenvalues and eigenvectors of the adjacency matrix.
#degree diagonal matrix degree_diag <- diag(1 / igraph::degree(union_graph_undir)) # PageRank matrix pagerank <- adjacency %*% degree_diag eigenvalues <- eigen(pagerank)
The eigenvector with the highest eigenvalue scores those vertices highly, that have many eges or that are connected to vertices with many edges.
eigenvector <- data.frame(name = rownames(pagerank), eigenvector = as.numeric(eigenvalues$vectors[, which.max(eigenvalues$values)])) union_characters <- left_join(union_characters, eigenvector, by = "name") eigenvector %>% arrange(eigenvector) %>% .[1:10, ] ## name eigenvector ## 1 Quellon Greyjoy -0.6625628 ## 2 Balon Greyjoy -0.3864950 ## 3 Lady of House Sunderly -0.3312814 ## 4 Alannys Harlaw -0.2760678 ## 5 Lady of House Stonetree -0.2208543 ## 6 Asha (Yara) Greyjoy -0.1656407 ## 7 Robin Greyjoy -0.1104271 ## 8 Euron Greyjoy -0.1104271 ## 9 Urrigon Greyjoy -0.1104271 ## 10 Victarion Greyjoy -0.1104271
Because of their highly connected family ties (i.e. there are only a handful of connections but they are almost all triangles), the Greyjoys have been scored with the highest eigenvalues.
We can find the eigenvector centrality scores with:
eigen_centrality <- igraph::eigen_centrality(union_graph_undir, directed = FALSE) eigen_centrality <- data.frame(name = names(eigen_centrality$vector), eigen_centrality = eigen_centrality$vector) %>% mutate(name = as.character(name)) union_characters <- left_join(union_characters, eigen_centrality, eigenvector, by = "name") eigen_centrality %>% arrange(-eigen_centrality) %>% .[1:10, ] ## name eigen_centrality ## 1 Tywin Lannister 1.0000000 ## 2 Cersei Lannister 0.9168980 ## 3 Joanna Lannister 0.8358122 ## 4 Jeyne Marbrand 0.8190076 ## 5 Tytos Lannister 0.8190076 ## 6 Genna Lannister 0.7788376 ## 7 Jaime Lannister 0.7642870 ## 8 Robert Baratheon 0.7087042 ## 9 Emmon Frey 0.6538709 ## 10 Walder Frey 0.6516021
When we consider eigenvector centrality, Tywin and the core Lannister family score highest.
We can now compare all the node-level information to decide which characters are the most important in Game of Thrones. Such node level characteristics could also be used as input for machine learning algorithms.
Let’s look at all characters from the major houses:
union_characters %>% filter(!is.na(house2)) %>% dplyr::select(-contains("_std")) %>% gather(x, y, degree:eigen_centrality) %>% ggplot(aes(x = name, y = y, color = house2)) + geom_point(size = 3) + facet_grid(x ~ house2, scales = "free") + theme_bw() + theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1))
Gives this plot:
Taken together, we could say that House Stark (specifically Ned and Sansa) and House Lannister (especially Tyrion) are the most important family connections in Game of Thrones.
We can also analyze dyads (pairs of two nodes), triads (groups of three nodes) and bigger cliques in our network. For dyads, we can use the function dyad_census() from igraph or dyad.census() from sna. Both are identical and calculate a Holland and Leinhardt dyad census with
#igraph::dyad_census(union_graph_undir) sna::dyad.census(adjacency) ## Mut Asym Null ## [1,] 326 0 21202
The same can be calculated for triads (see ?triad_census for details on what each output means).
#igraph::triad_census(union_graph_undir) sna::triad.census(adjacency) triad.classify(adjacency, mode = "graph") ## 003 012 102 021D 021U 021C 111D 111U 030T 030C 201 120D 120U 120C 210 300 ## [1,] 1412100 0 65261 0 0 0 0 0 0 0 790 0 0 0 0 105 ## [1] 2
We can also calculate the number of paths and cycles of any length we specify, here e.g. of length <= 5. For edges, we obtain the sum of counts for all paths or cycles up to the given maximum length. For vertices/nodes, we obtain the number of paths or cycles to which each node belongs.
node_kpath <- kpath.census(adjacency, maxlen = 5, mode = "graph", tabulate.by.vertex = TRUE, dyadic.tabulation = "sum") edge_kpath <- kpath.census(adjacency, maxlen = 5, mode = "graph", tabulate.by.vertex = FALSE) edge_kpath ## $path.count ## 1 2 3 4 5 ## 326 1105 2973 7183 17026
This, we could plot with (but here, it does not give much additional information):
gplot(node_kpath$paths.bydyad, label.cex = 0.5, vertex.cex = 0.75, displaylabels = TRUE, edge.col = "grey") node_kcycle <- kcycle.census(adjacency, maxlen = 8, mode = "graph", tabulate.by.vertex = TRUE, cycle.comembership = "sum") edge_kcycle <- kcycle.census(adjacency, maxlen = 8, mode = "graph", tabulate.by.vertex = FALSE) edge_kcycle node_kcycle_reduced <- node_kcycle$cycle.comemb node_kcycle_reduced <- node_kcycle_reduced[which(rowSums(node_kcycle_reduced) > 0), which(colSums(node_kcycle_reduced) > 0)] gplot(node_kcycle_reduced, label.cex = 0.5, vertex.cex = 0.75, displaylabels = TRUE, edge.col = "grey")
“A (maximal) clique is a maximal set of mutually adjacency vertices.” clique.census() help
node_clique_reduced <- node_clique$clique.comemb node_clique_reduced <- node_clique_reduced[which(rowSums(node_clique_reduced) > 0), which(colSums(node_clique_reduced) > 0)] gplot(node_clique_reduced, label.cex = 0.5, vertex.cex = 0.75, displaylabels = TRUE, edge.col = "grey")
The largest group of nodes ín this network is three, i.e. all parent/child relationships. Therefore, it does not really make sense to plot them all, but we could plot and color them with:
vcol <- rep("grey80", vcount(union_graph_undir)) # highlight first of largest cliques vcol[unlist(largest_cliques(union_graph_undir)[[1]])] <- "red" plot(union_graph_undir, layout = layout, vertex.label = gsub(" ", "\n", V(union_graph_undir)$name), vertex.shape = V(union_graph_undir)$shape, vertex.color = vcol, vertex.size = 5, vertex.frame.color = "gray", vertex.label.color = "black", vertex.label.cex = 0.8, edge.width = 2, edge.arrow.size = 0.5, edge.color = E(union_graph_undir)$color, edge.lty = E(union_graph_undir)$lty)
We can also look for groups within our network by clustering node groups according to their edge betweenness:
ceb <- cluster_edge_betweenness(union_graph_undir) plot(ceb, union_graph_undir, layout = layout, vertex.label = gsub(" ", "\n", V(union_graph_undir)$name), vertex.shape = V(union_graph_undir)$shape, vertex.size = (V(union_graph_undir)$popularity + 0.5) * 5, vertex.frame.color = "gray", vertex.label.color = "black", vertex.label.cex = 0.8)
Or based on propagating labels:
clp <- cluster_label_prop(union_graph_undir) plot(clp, union_graph_undir, layout = layout, vertex.label = gsub(" ", "\n", V(union_graph_undir)$name), vertex.shape = V(union_graph_undir)$shape, vertex.size = (V(union_graph_undir)$popularity + 0.5) * 5, vertex.frame.color = "gray", vertex.label.color = "black", vertex.label.cex = 0.8)
We can also feed our adjacency matrix to other functions, like GenInd() from the NetIndices packages. This function calculates a number of network properties, like number of compartments (N), total system throughput (T..), total system throughflow (TST), number of internal links (Lint), total number of links (Ltot), like density (LD), connectance (C), average link weight (Tijbar), average compartment throughflow (TSTbar) and compartmentalization or degree of connectedness of subsystems in the network (Cbar).
library(NetIndices) graph.properties <- GenInd(adjacency) graph.properties
Alternatively, the network package provides additional functions to obtain network properties. Here, we can again feed in the adjacency matrix of our network and convert it to a network object.
library(network) adj_network <- network(adjacency, directed = TRUE) adj_network ## Network attributes: ## vertices = 208 ## directed = TRUE ## hyper = FALSE ## loops = FALSE ## multiple = FALSE ## bipartite = FALSE ## total edges= 652 ## missing edges= 0 ## non-missing edges= 652 ## ## Vertex attribute names: ## vertex.names ## ## No edge attributes
From this network object, we can e.g. get the number of dyads and edges within a network and the network size.
network.dyadcount(adj_network) network.edgecount(adj_network) network.size(adj_network) ## [1] 43056 ## [1] 652 ## [1] 208
“equiv.clust uses a definition of approximate equivalence (equiv.fun) to form a hierarchical clustering of network positions. Where dat consists of multiple relations, all specified relations are considered jointly in forming the equivalence clustering.” equiv.clust() help
ec <- equiv.clust(adj_network, mode = "graph", cluster.method = "average", plabels = network.vertex.names(adj_network)) ec ## Position Clustering: ## ## Equivalence function: sedist ## Equivalence metric: hamming ## Cluster method: average ## Graph order: 208
ec$cluster$labels <- ec$plabels plot(ec)
From the sna package, we can e.g. use functions that tell us the graph density and the dyadic reciprocity of the vertices or edges
gden(adjacency) grecip(adjacency) grecip(adjacency, measure = "edgewise") ## [1] 0.01514307 ## Mut ## 1 ## Mut ## 1
sessionInfo() ## R version 3.4.0 (2017-04-21) ## Platform: x86_64-w64-mingw32/x64 (64-bit) ## Running under: Windows 7 x64 (build 7601) Service Pack 1 ## ## Matrix products: default ## ## locale: ## [1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252 LC_MONETARY=English_United States.1252 LC_NUMERIC=C LC_TIME=English_United States.1252 ## ## attached base packages: ## [1] stats graphics grDevices utils datasets methods base ## ## other attached packages: ## [1] NetIndices_1.4.4 MASS_7.3-47 statnet_2016.9 sna_2.4 ergm.count_3.2.2 tergm_3.4.0 networkDynamic_0.9.0 ergm_3.7.1 network_1.13.0 statnet.common_3.3.0 igraph_1.0.1 dplyr_0.5.0 purrr_0.2.2 readr_1.1.0 tidyr_0.6.2 tibble_1.3.0 ggplot2_2.2.1 tidyverse_1.1.1 ## ## loaded via a namespace (and not attached): ## [1] lpSolve_5.6.13 reshape2_1.4.2 haven_1.0.0 lattice_0.20-35 colorspace_1.3-2 htmltools_0.3.6 yaml_2.1.14 foreign_0.8-68 DBI_0.6-1 modelr_0.1.0 readxl_1.0.0 trust_0.1-7 plyr_1.8.4 robustbase_0.92-7 stringr_1.2.0 munsell_0.4.3 gtable_0.2.0 cellranger_1.1.0 rvest_0.3.2 coda_0.19-1 psych_1.7.5 evaluate_0.10 labeling_0.3 knitr_1.15.1 forcats_0.2.0 parallel_3.4.0 DEoptimR_1.0-8 broom_0.4.2 Rcpp_0.12.10 scales_0.4.1 backports_1.0.5 jsonlite_1.4 mnormt_1.5-5 hms_0.3 digest_0.6.12 stringi_1.1.5 grid_3.4.0 rprojroot_1.2 tools_3.4.0 magrittr_1.5 lazyeval_0.2.0 Matrix_1.2-10 xml2_1.1.1 lubridate_1.6.0 assertthat_0.2.0 rmarkdown_1.5 httr_1.2.1 R6_2.2.0 nlme_3.1-131 compiler_3.4.0
If you have questions feel free to post a comment.
Related Post
]]>Related Post
]]>suppressPackageStartupMessages(library(strucchange)) suppressPackageStartupMessages(library(fUnitRoots)) suppressPackageStartupMessages(library(astsa))
We are going to do a basic exploration of the globtemp time series as available within the astsa package. The globtemp dataset reports the deviation (in degrees centigrade) from [1951-1980] global mean land-ocean temperature. Let us have a look at it.
data("globtemp") globtemp Time Series: Start = 1880 End = 2015 Frequency = 1 [1] -0.20 -0.11 -0.10 -0.20 -0.28 -0.31 -0.30 -0.33 -0.20 -0.11 -0.37 -0.24 -0.27 [ t14] -0.30 -0.31 -0.22 -0.15 -0.11 -0.28 -0.16 -0.09 -0.15 -0.28 -0.36 -0.45 -0.28 [27] -0.23 -0.40 -0.44 -0.47 -0.43 -0.44 -0.35 -0.35 -0.16 -0.11 -0.33 -0.40 -0.26 [40] -0.23 -0.26 -0.21 -0.27 -0.24 -0.28 -0.20 -0.09 -0.20 -0.21 -0.36 -0.13 -0.09 [53] -0.17 -0.28 -0.13 -0.19 -0.15 -0.02 -0.02 -0.03 0.08 0.13 0.10 0.14 0.26 [66] 0.12 -0.03 -0.04 -0.09 -0.09 -0.17 -0.06 0.01 0.08 -0.12 -0.14 -0.20 0.03 [79] 0.06 0.03 -0.03 0.05 0.02 0.06 -0.20 -0.10 -0.05 -0.02 -0.07 0.07 0.03 [92] -0.09 0.01 0.15 -0.08 -0.01 -0.11 0.18 0.07 0.16 0.27 0.32 0.13 0.31 [105] 0.16 0.12 0.19 0.33 0.40 0.28 0.44 0.42 0.23 0.24 0.32 0.46 0.34 [118] 0.48 0.63 0.42 0.42 0.55 0.63 0.62 0.55 0.69 0.63 0.66 0.54 0.64 [131] 0.72 0.60 0.63 0.66 0.75 0.87
summary(globtemp) Min. 1st Qu. Median Mean 3rd Qu. Max. -0.47000 -0.21000 -0.07500 0.01838 0.18250 0.87000
plot(globtemp) grid()
We can see a remarkable increase of the temperature deviations in the last decades. The globtemp time series appears to be non stationary due basically to the last decades upward trend. A plot of globtemp against its smoothed fit may help in understand better.
tt <- 1:length(globtemp) fit <- ts(loess(globtemp ~ tt, span = 0.2)$fitted, start = 1880, frequency = 1) plot(globtemp, type='l') lines(fit, col = 4) grid()
The globtemp density plot is herein shown.
plot(density(globtemp), main = "globtemp density plot")
We are going to run the Augmented Dickey-Fuller test with type = “ct” having the following models, null-hypothesis and test statistics.
$$
\begin{equation}
\begin{cases}
\ Model:\ \Delta y_{t}\ =\ a_{0}+ \gamma y_{t-1} + a_{2}t\ + \epsilon_{t} \\
\\
H_{0}: \gamma = 0\ \ \ \ \ test\ statistics: \tau_{3} \\
\\
H_{0}: \gamma = a_{2} = 0\ \ \ \ \ test\ statistics: \phi_{3} \\
\\
H_{0}: \ a_{0} = \gamma = a_{2} = 0\ \ \ \ \ test\ statistics: \phi_{2} \\
\end{cases}
\end{equation}
$$
See ref. [3] for details about the Dickey-Fuller test and its report as output by the urdfTest() function within the fUnitRoots package.
urdftest_lag = floor(12*(length(globtemp)/100)^0.25) # long urdfTest(globtemp, lags = urdftest_lag, type = c("ct"), doplot = FALSE) Title: Augmented Dickey-Fuller Unit Root Test Test Results: Test regression trend Call: lm(formula = z.diff ~ z.lag.1 + 1 + tt + z.diff.lag) Residuals: Min 1Q Median 3Q Max -0.233454 -0.061206 0.000907 0.067984 0.196946 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -0.066984 0.049114 -1.364 0.17546 z.lag.1 -0.104138 0.086354 -1.206 0.23048 tt 0.001213 0.000651 1.863 0.06517 . z.diff.lag1 -0.273952 0.119332 -2.296 0.02362 * z.diff.lag2 -0.324109 0.120821 -2.683 0.00845 ** z.diff.lag3 -0.263840 0.122262 -2.158 0.03314 * z.diff.lag4 -0.029046 0.123184 -0.236 0.81404 z.diff.lag5 -0.164546 0.124910 -1.317 0.19052 z.diff.lag6 -0.121042 0.124239 -0.974 0.33210 z.diff.lag7 -0.063802 0.122548 -0.521 0.60369 z.diff.lag8 0.043304 0.119004 0.364 0.71665 z.diff.lag9 -0.088910 0.116985 -0.760 0.44891 z.diff.lag10 0.071604 0.111461 0.642 0.52197 z.diff.lag11 -0.049646 0.102584 -0.484 0.62940 z.diff.lag12 -0.037257 0.095916 -0.388 0.69846 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.09935 on 108 degrees of freedom Multiple R-squared: 0.2657, Adjusted R-squared: 0.1705 F-statistic: 2.791 on 14 and 108 DF, p-value: 0.001403 Value of test-statistic is: -1.2059 2.8535 2.3462 Critical values for test statistics: 1pct 5pct 10pct tau3 -3.99 -3.43 -3.13 phi2 6.22 4.75 4.07 phi3 8.43 6.49 5.47
By comparing the test statistics with the critical values at 5% significance level we cannot reject any of the null hypothesis. As a consequence, the unit root hypothesis cannot be rejected. Since in case of structural breaks, the Dickey Fuller test is biased toward the non rejection of the null hypothesis (ref. [3]), we run the KPSS test having the trend stationarity hypothesis as null (i.e. deterministic trend with stationary residuals).
urkpssTest(globtemp, type = c("tau"), lags = c("long"), doplot = FALSE) Title: KPSS Unit Root Test Test Results: Test is of type: tau with 12 lags. Value of test-statistic is: 0.2114 Critical value for a significance level of: 10pct 5pct 2.5pct 1pct critical values 0.119 0.146 0.176 0.216
We reject the trend stationarity hypothesis based on the resulting test statistics compared with their significance levels.
Bai and Perron established a general methodology for estimating breakpoints and their associated confidence intervals in OLS regression employing dynamic programming. In that way, it is possible to find m breakpoints that minimize the residual sum of square (RSS) associated to a model with m+1 segments given some minimal segment size of h·n observations. The h bandwidth parameter is chosen by the user typically equal to 0.1 or 0.15. Since the number of breakpoints m is not known in advance, it is necessary to compute the optimal breakpoints for m = 0, 1, … breaks and choose the model that minimizes some information criterion such as BIC (ref. [1]). That model selection strategy is available within breakpoints() function of the strucchange R package (ref. [2]).
In the following, we determine the globtemp time series structural changes dates, if any. Such analysis is named as “dating structural changes (breaks)”. Specifically, we are looking for:
* level structural changes * trend structural changes * polinomial fit structural changes * auto-regressive model structural changes
Level structural changes can be determined with the help of the following formula:
globtemp ~ 1
We remark that any structural change analysis should be run against regressors which are significative in terms of time series fit. At that purpose we run:
summary(lm(globtemp ~ 1)) Call: lm(formula = globtemp ~ 1) Residuals: Min 1Q Median 3Q Max -0.48838 -0.22838 -0.09338 0.16412 0.85162 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 0.01838 0.02721 0.676 0.5 Residual standard error: 0.3173 on 135 degrees of freedom
The intercept coefficient is not significative, likely due to the upward trend observable in the last decades. We then shorten our time series and run again the same regression.
globtemp_win <- window(globtemp, end = 2000) lev_fit <- lm(globtemp_win ~ 1) summary(lev_fit) Call: lm(formula = globtemp_win ~ 1) Residuals: Min 1Q Median 3Q Max -0.41017 -0.17017 -0.04017 0.13983 0.68983 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -0.05983 0.02161 -2.769 0.00652 ** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.2377 on 120 degrees of freedom
The intercept coefficient is now reported as significant. Let us plot the time series against the fit.
plot(globtemp_win) lines(ts(fitted(lev_fit), start = 1880, frequency = 1), col = 4)
We go on with the search for structural changes.
globtemp_brk <- breakpoints(globtemp_win ~ 1, h = 0.1) summary(globtemp_brk) Optimal (m+1)-segment partition: Call: breakpoints.formula(formula = globtemp_win ~ 1, h = 0.1) Breakpoints at observation number: m = 1 97 m = 2 57 100 m = 3 57 97 109 m = 4 22 34 57 100 m = 5 22 34 57 97 109 m = 6 22 34 57 69 97 109 m = 7 22 34 57 69 81 97 109 m = 8 22 34 46 58 70 85 97 109 m = 9 12 24 36 48 60 72 84 97 109 Corresponding to breakdates: m = 1 1976 m = 2 1936 1979 m = 3 1936 1976 1988 m = 4 1901 1913 1936 1979 m = 5 1901 1913 1936 1976 1988 m = 6 1901 1913 1936 1948 1976 1988 m = 7 1901 1913 1936 1948 1960 1976 1988 m = 8 1901 1913 1925 1937 1949 1964 1976 1988 m = 9 1891 1903 1915 1927 1939 1951 1963 1976 1988 Fit: m 0 1 2 3 4 5 6 7 8 9 RSS 6.7814 2.7965 1.3933 1.2582 1.1600 1.0249 0.9663 0.9606 0.9760 1.1440 BIC 4.3002 -93.2918 -168.0043 -170.7502 -170.9914 -176.3777 -173.9182 -165.0385 -153.5225 -124.7135
Above the results of finding m = 1..9 breakpoints with associated dates and {RSS, BIC} metrics. The minimum value of BIC is reached for m = 5. We plot the breakpoint() function output to gather a visual understanding of.
plot(globtemp_brk)
The plot of the observed and fitted time series, along with confidence intervals for the breakpoints, is given by:
plot(globtemp_win) lines(fitted(globtemp_brk, breaks = 5), col = 4) lines(confint(globtemp_brk, breaks = 5))
The break dates are:
breakdates(globtemp_brk, breaks = 5) [1] 1901 1913 1936 1976 1988
Level breaks coefficients:
coef(globtemp_brk, breaks = 5) (Intercept) 1880 - 1901 -0.2177273 1902 - 1913 -0.3733333 1914 - 1936 -0.2152174 1937 - 1976 -0.0085000 1977 - 1988 0.2200000 1989 - 2000 0.3900000
Trend structural changes can be determined with the help of the following formula:
globtemp ~ tt
where tt is the globtemp timeline (detailed below). Again, we have first to verify that such regression makes sense for our time series by evaluating coefficients significance.
l <- length(globtemp) tt <- 1:l trend_fit <- lm(globtemp ~ tt) summary(trend_fit) Call: lm(formula = globtemp ~ tt) Residuals: Min 1Q Median 3Q Max -0.33363 -0.11470 -0.02466 0.11932 0.38017 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -0.4600523 0.0273468 -16.82 <2e-16 *** tt 0.0069844 0.0003464 20.16 <2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.1586 on 134 degrees of freedom Multiple R-squared: 0.7521, Adjusted R-squared: 0.7503 F-statistic: 406.6 on 1 and 134 DF, p-value: < 2.2e-16
Both intercept and slope coefficients are reported as significative. Let us plot the time series against the fit.
plot(globtemp) lines(ts(fitted(trend_fit), start=1880, frequency = 1), col = 4)
We go on with the search for structural changes.
globtemp_brk <- breakpoints(globtemp ~ tt, h = 0.1) summary(globtemp_brk) Optimal (m+1)-segment partition: Call: breakpoints.formula(formula = globtemp ~ tt, h = 0.1) Breakpoints at observation number: m = 1 84 m = 2 23 84 m = 3 27 66 84 m = 4 23 53 66 84 m = 5 16 32 53 66 84 m = 6 16 32 53 66 84 97 m = 7 16 32 53 66 84 97 117 m = 8 16 32 53 66 84 97 110 123 m = 9 14 27 40 53 66 84 97 110 123 Corresponding to breakdates: m = 1 1963 m = 2 1902 1963 m = 3 1906 1945 1963 m = 4 1902 1932 1945 1963 m = 5 1895 1911 1932 1945 1963 m = 6 1895 1911 1932 1945 1963 1976 m = 7 1895 1911 1932 1945 1963 1976 1996 m = 8 1895 1911 1932 1945 1963 1976 1989 2002 m = 9 1893 1906 1919 1932 1945 1963 1976 1989 2002 Fit: m 0 1 2 3 4 5 6 7 8 9 RSS 3.3697 1.6670 1.2446 1.0202 0.8907 0.8229 0.7975 0.7739 0.7889 0.8282 BIC -102.2140 -183.1946 -208.1954 -220.4943 -224.2281 -220.2494 -209.7707 -199.1243 -181.7788 -160.4341
plot(globtemp_brk)
The BIC minimum value is reached for m = 4.
plot(globtemp) lines(fitted(globtemp_brk, breaks = 4), col = 4) lines(confint(globtemp_brk, breaks = 4))
Break dates:
breakdates(globtemp_brk, breaks = 4) [1] 1902 1932 1945 1963
Trend breaks coefficients:
coef(globtemp_brk, breaks = 4) (Intercept) tt 1880 - 1902 -0.2312253 0.0008992095 1903 - 1932 -0.6037597 0.0084093437 1933 - 1945 -2.2475824 0.0374725275 1946 - 1963 -0.5640454 0.0070072239 1964 - 2015 -1.5983672 0.0173520874
Second degree polinomial structural changes can be determined with the help of the following formula:
globtemp ~ tt + I(tt^2)
where tt is the globtemp timeline. Once again, we have first to verify that such regression makes sense for our time series by evaluating coefficients significance.
pol_fit <- lm(globtemp ~ tt + I(tt^2)) summary(pol_fit) Call: lm(formula = globtemp ~ tt + I(tt^2)) Residuals: Min 1Q Median 3Q Max -0.27014 -0.08379 0.00685 0.07299 0.38625 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -2.124e-01 3.015e-02 -7.045 9.02e-11 *** tt -3.784e-03 1.016e-03 -3.725 0.000288 *** I(tt^2) 7.860e-05 7.183e-06 10.943 < 2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.1155 on 133 degrees of freedom Multiple R-squared: 0.8696, Adjusted R-squared: 0.8676 F-statistic: 443.3 on 2 and 133 DF, p-value: < 2.2e-16
All coefficients are reported as significative. Let us plot the time series against the fit.
plot(globtemp, type = 'l') lines(ts(fitted(pol_fit), start = 1880, frequency = 1), col = 4)
We go on with the search for structural changes.
globtemp_brk <- breakpoints(globtemp ~ tt + I(tt^2), data = globtemp, h = 0.1) summary(globtemp_brk) Optimal (m+1)-segment partition: Call: breakpoints.formula(formula = globtemp ~ tt + I(tt^2), h = 0.1, data = globtemp) Breakpoints at observation number: m = 1 66 m = 2 22 66 m = 3 22 66 84 m = 4 20 36 66 84 m = 5 20 36 66 84 97 m = 6 20 36 53 66 84 97 m = 7 20 36 53 66 84 97 112 m = 8 20 36 53 66 84 97 110 123 m = 9 19 32 45 58 71 84 97 110 123 Corresponding to breakdates: m = 1 1945 m = 2 1901 1945 m = 3 1901 1945 1963 m = 4 1899 1915 1945 1963 m = 5 1899 1915 1945 1963 1976 m = 6 1899 1915 1932 1945 1963 1976 m = 7 1899 1915 1932 1945 1963 1976 1991 m = 8 1899 1915 1932 1945 1963 1976 1989 2002 m = 9 1898 1911 1924 1937 1950 1963 1976 1989 2002 Fit: m 0 1 2 3 4 5 6 7 8 9 RSS 1.7732 1.1662 0.9977 0.8903 0.7985 0.7348 0.6883 0.6453 0.6491 0.7024 BIC -184.6168 -221.9545 -223.5269 -219.3667 -214.5125 -206.1761 -195.4235 -184.5463 -164.0878 -133.7079
plot(globtemp_brk)
The BIC minimum value is reached for m = 2.
plot(globtemp) lines(fitted(globtemp_brk, breaks = 2), col = 4) lines(confint(globtemp_brk, breaks = 2))
Break dates:
breakdates(globtemp_brk, breaks = 2) [1] 1901 1945
Polinomial fit breaks coefficients:
coef(globtemp_brk, breaks = 2) (Intercept) tt I(tt^2) 1880 - 1901 -0.11675325 -0.02862084 0.0013226990 1902 - 1945 -0.06025445 -0.02034837 0.0003589594 1946 - 2015 0.61164924 -0.02197550 0.0001724349
First differencing of the globtemp time series is computed to make it as stationary. The result is zero-centered by subtracting its mean.
diff_globtemp <- diff(globtemp) - mean(diff(globtemp)) plot(diff_globtemp, type = 'l') grid()
The Augmented Dickey-Fuller test with type = “nc” having the following null hypothesis and test statistics is run.
$$
\begin{equation}
\begin{cases}
\ Model:\ \Delta y_{t}\ =\ \gamma y_{t-1} + \epsilon_{t} \\
\\
H_{0}: \gamma = 0\ \ \ \ \ test\ statistics: \tau_{1} \\
\\
\end{cases}
\end{equation}
$$
urdftest_lag = floor(12*(length(diff_globtemp)/100)^0.25) # long urdfTest(diff_globtemp, lags = urdftest_lag, type = c("nc"), doplot = FALSE) Title: Augmented Dickey-Fuller Unit Root Test Test Results: Test regression none Call: lm(formula = z.diff ~ z.lag.1 - 1 + z.diff.lag) Residuals: Min 1Q Median 3Q Max -0.21075 -0.06373 0.00499 0.07173 0.18976 Coefficients: Estimate Std. Error t value Pr(>|t|) z.lag.1 -2.69424 0.69716 -3.865 0.000189 *** z.diff.lag1 1.35360 0.67167 2.015 0.046336 * z.diff.lag2 0.98047 0.64080 1.530 0.128899 z.diff.lag3 0.69130 0.60389 1.145 0.254820 z.diff.lag4 0.64252 0.56115 1.145 0.254718 z.diff.lag5 0.47524 0.51691 0.919 0.359925 z.diff.lag6 0.34398 0.46422 0.741 0.460287 z.diff.lag7 0.26961 0.40719 0.662 0.509299 z.diff.lag8 0.29206 0.34707 0.842 0.401906 z.diff.lag9 0.20160 0.28928 0.697 0.487350 z.diff.lag10 0.25317 0.22206 1.140 0.256761 z.diff.lag11 0.17314 0.15559 1.113 0.268243 z.diff.lag12 0.10922 0.09352 1.168 0.245440 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.1003 on 109 degrees of freedom Multiple R-squared: 0.6922, Adjusted R-squared: 0.6555 F-statistic: 18.86 on 13 and 109 DF, p-value: < 2.2e-16 Value of test-statistic is: -3.8646 Critical values for test statistics: 1pct 5pct 10pct tau1 -2.58 -1.95 -1.62
By comparing the test statistics with the critical value at 5% significance level, we reject the unit root null-hypothesis. Further, we run the KPSS test with type = “mu” to test the null hypothesis of level stationarity.
urkpssTest(diff_globtemp, type = c("mu"), lags = c("long"), doplot = FALSE) Title: KPSS Unit Root Test Test Results: Test is of type: mu with 12 lags. Value of test-statistic is: 0.3308 Critical value for a significance level of: 10pct 5pct 2.5pct 1pct critical values 0.347 0.463 0.574 0.739
Based on the reported test statistics and critical values, we cannot reject the null hypothesis of level stationarity. We then evaluate a linear regression model having lag-1 and lag-2 as regressor to fit the current value.
lag_1 <- lag(diff_globtemp, -1) lag_2 <- lag(diff_globtemp, -2) globtemp_df <- ts.intersect(dd0 = diff_globtemp, dd1 = lag_1, dd2 = lag_2) summary(lm(dd0 ~ dd1 + dd2 - 1, data = globtemp_df)) Call: lm(formula = dd0 ~ dd1 + dd2 - 1, data = globtemp_df) Residuals: Min 1Q Median 3Q Max -0.268417 -0.073502 0.007569 0.073164 0.266005 Coefficients: Estimate Std. Error t value Pr(>|t|) dd1 -0.30442 0.08463 -3.597 0.000455 *** dd2 -0.27040 0.08463 -3.195 0.001752 ** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.1029 on 131 degrees of freedom Multiple R-squared: 0.1246, Adjusted R-squared: 0.1112 F-statistic: 9.323 on 2 and 131 DF, p-value: 0.0001639
Both lag-1 and lag-2 coefficients are reported as significant. Then we inspect if any structural change occurs for our auto-regressive model.
dd_brk <- breakpoints(dd0 ~ dd1 + dd2 - 1, data = globtemp_df, h = 0.1) summary(dd_brk) Optimal (m+1)-segment partition: Call: breakpoints.formula(formula = dd0 ~ dd1 + dd2 - 1, h = 0.1, data = globtemp_df) Breakpoints at observation number: m = 1 78 m = 2 78 102 m = 3 64 78 102 m = 4 19 36 78 102 m = 5 17 35 63 78 102 m = 6 17 35 63 78 102 120 m = 7 17 35 48 64 78 102 120 m = 8 17 35 48 64 78 93 106 119 m = 9 13 26 39 54 67 80 93 106 119 Corresponding to breakdates: m = 1 1960 m = 2 1960 1984 m = 3 1946 1960 1984 m = 4 1901 1918 1960 1984 m = 5 1899 1917 1945 1960 1984 m = 6 1899 1917 1945 1960 1984 2002 m = 7 1899 1917 1930 1946 1960 1984 2002 m = 8 1899 1917 1930 1946 1960 1975 1988 2001 m = 9 1895 1908 1921 1936 1949 1962 1975 1988 2001 Fit: m 0 1 2 3 4 5 6 7 8 9 RSS 1.387 1.338 1.290 1.270 1.250 1.223 1.212 1.202 1.198 1.221 BIC -214.826 -204.918 -195.122 -182.486 -169.973 -158.211 -144.707 -131.090 -116.950 -99.757
plot(dd_brk)
In this scenario, we cannot reach a conclusion for a value of m, as the BIC is minimum for m = 0. Ref. [1] shows a similar example and it points out that BIC was found to be somewhat unreliable for auto-regressive models by Bai and Perron. We then try to identify structural changes relying on a F statistics test available within the strucchange package as well.
globtemp_Fstats <- Fstats(dd0 ~ dd1 + dd2 - 1, data = globtemp_df, from = 0.05, to = 0.95) plot(globtemp_Fstats)
No boundary crossing can be spotted (the boundary is represented by the red horizontal line on the top of the plot).
sctest(globtemp_Fstats, type = "supF") supF test data: globtemp_Fstats sup.F = 4.7036, p-value = 0.7706
The sctest p-value confirms there is no significative breakpoint. We then conclude there are not any structural changes in the auto-regressive model under analysis.
After a basic exploration of the globtemp dataset, we leveraged on functions available within the strucchange package to date structural changes. Level, trend and second-degree polinomial fit breaks were identified. Differently, no breaks were found in the auto-regressive model based on lag-1 and lag-2 regressors. We further remark that the strucchange package provides with additional features such as generalized fluctuation tests as depicted by ref. [2].
If you have any questions, please feel free to comment below.
Disclaimer
References
Related Post
]]>Related Post
]]>rvest
package and generate beautiful maps using ggplot
and maps
package in R. A similar post was published earlier at DataScience+.
require(rvest) #rvest is the package to scrape Web pages in R # ?rvest to search more on this package require(ggplot2) require(dplyr) require(scales) require(maps)
Now Scraping Data from wikipedia article and converting it to a R Data frame
#Loading the Data obesity<-read_html("https://en.wikipedia.org/wiki/Obesity_in_the_United_States") ?read_html #for knowing more on this function #html_nodes() to select a particular HTML element from the above page #Converting to a R dataframe #xpath of the Wikipedia table data obesity = obesity %>% html_nodes(xpath='//*[@id="mw-content-text"]/div/table[2]') %>% .[[1]] %>% html_table(fill=T) head(obesity) ## State and District of Columbia Obese adults ## 1 Alabama 30.1% ## 2 Alaska 27.3% ## 3 Arizona 23.3% ## 4 Arkansas 28.1% ## 5 California 23.1% ## 6 Colorado 21.0% ## Overweight (incl. obese) adults Obese children and adolescents ## 1 65.4% 16.7% ## 2 64.5% 11.1% ## 3 59.5% 12.2% ## 4 64.7% 16.4% ## 5 59.4% 13.2% ## 6 55.0% 9.9% ## Obesity rank ## 1 3 ## 2 14 ## 3 40 ## 4 9 ## 5 41 ## 6 51
Data Transformation
#to check the structure of the data str(obesity) ## 'data.frame': 51 obs. of 5 variables: ## $ State and District of Columbia : chr "Alabama" "Alaska" "Arizona" "Arkansas" ... ## $ Obese adults : chr "30.1%" "27.3%" "23.3%" "28.1%" ... ## $ Overweight (incl. obese) adults: chr "65.4%" "64.5%" "59.5%" "64.7%" ... ## $ Obese children and adolescents : chr "16.7%" "11.1%" "12.2%" "16.4%" ... ## $ Obesity rank : int 3 14 40 9 41 51 49 43 22 39 ...
We need to remove the ‘%’ from the data and convert it to numeric data type to draw plots using ggplot
package.
#removing the % and making the data numeric for(i in 2:4){ obesity[,i] = gsub("%", "", obesity[,i]) obesity[,i] = as.numeric(obesity[,i]) } str(obesity) ## 'data.frame': 51 obs. of 5 variables: ## $ State and District of Columbia : chr "Alabama" "Alaska" "Arizona" "Arkansas" ... ## $ Obese adults : num 30.1 27.3 23.3 28.1 23.1 21 20.8 22.1 25.9 23.3 ... ## $ Overweight (incl. obese) adults: num 65.4 64.5 59.5 64.7 59.4 55 58.7 55 63.9 60.8 ... ## $ Obese children and adolescents : num 16.7 11.1 12.2 16.4 13.2 9.9 12.3 14.8 22.8 14.4 ... ## $ Obesity rank : int 3 14 40 9 41 51 49 43 22 39 ...
Now fixing the attributes Names to remove spaces between them.
#Fixing the names to remove spaces names(obesity) ## [1] "State and District of Columbia" "Obese adults" ## [3] "Overweight (incl. obese) adults" "Obese children and adolescents" ## [5] "Obesity rank"
We will use make.names
to remove spaces and make syntactically valid names.
names(obesity) = make.names(names(obesity)) names(obesity) ## [1] "State.and.District.of.Columbia" "Obese.adults" ## [3] "Overweight..incl..obese..adults" "Obese.children.and.adolescents" ## [5] "Obesity.rank"
maps
package in R provides methods to load the geographical data of different countries and the world in R to a data frame consisting of Latitudes and Longitudes as well which can further be used to generate and visualize maps in R.
#Loading the map data----------------- states = map_data("state") ?map_data to read more on using this function # create a new variable region for state obesity$region = tolower(obesity$State.and.District.of.Columbia) #merging the datasets states = merge(states, obesity, by="region", all.x=T) str(states) ## 'data.frame': 15537 obs. of 11 variables: ## $ region : chr "alabama" "alabama" "alabama" "alabama" ... ## $ long : num -87.5 -87.5 -87.5 -87.5 -87.6 ... ## $ lat : num 30.4 30.4 30.4 30.3 30.3 ... ## $ group : num 1 1 1 1 1 1 1 1 1 1 ... ## $ order : int 1 2 3 4 5 6 7 8 9 10 ... ## $ subregion : chr NA NA NA NA ... ## $ State.and.District.of.Columbia : chr "Alabama" "Alabama" "Alabama" "Alabama" ... ## $ Obese.adults : num 30.1 30.1 30.1 30.1 30.1 30.1 30.1 30.1 30.1 30.1 ... ## $ Overweight..incl..obese..adults: num 65.4 65.4 65.4 65.4 65.4 65.4 65.4 65.4 65.4 65.4 ... ## $ Obese.children.and.adolescents : num 16.7 16.7 16.7 16.7 16.7 16.7 16.7 16.7 16.7 16.7 ... ## $ Obesity.rank : int 3 3 3 3 3 3 3 3 3 3 ...
#a data frame for adding Names to the states on the Map- making a new data frame statenames = states %>% group_by(region) %>% summarise( long = mean(range(long)), lat = mean(range(lat)), group = mean(group), Obese.adults = mean(Obese.adults), Obese.children.and.adolescents = mean(Obese.children.and.adolescents) )
Now finding the top 10 states with most Obese Adult population using dplyr
package.
#Data frame consisting of top 10 Most Obese Adults States topstate = states %>% group_by(region) %>% summarise( Obese.adults = mean(Obese.adults) ) %>% arrange(desc(Obese.adults)) %>% top_n(10)
Making a Barplot.
#Plotting the top 10 states ggplot(aes(x = reorder(region,Obese.adults), y = Obese.adults),data = topstate) + geom_col(color="black",fill="#1EDBC2",alpha=0.6) + labs(y = "Percentage of Obese Adults",x="Top 10 States") + coord_flip()
From the Barplot we notice that the State with Highest Obese Adult Population is Mississippi.
#Plotting the data on a map------------------------ #For obese adults ggplot(states, aes(x = long, y = lat, group = group, fill = Obese.adults)) + geom_polygon(color = "white",show.legend = T) + scale_fill_gradient(name = "Percent", low = "#FAB8D2", high = "#F91C74", guide = "colorbar", na.value="black", breaks = pretty_breaks(n = 5)) + labs(title="Obesity in Adults for USA",x = "Longitude",y = "Latitude") + coord_map() + #adding States names to the states on the map geom_text(data=statenames, aes(x = long, y = lat, label = region), size=3)
The darker regions on the map indicate the State with highest percentage of Obese Adult population. The state with minimum obese adult population is Connecticut.
Creating an new data frame with Top 15 States with Most Obese Children and Teens Population.
#Now Analyzing the Obese Children and Teens #Finding top 15 States with Most Obese Children and Teens topChild = states %>% group_by(region) %>% summarise(Obese.Child.and.Teens = mean(Obese.children.and.adolescents)) %>% top_n(15)
Making a Barplot
#Barplot ggplot(data = topChild, aes(x = reorder(region,Obese.Child.and.Teens), y = Obese.Child.and.Teens))+ geom_col(color="black",fill="#6EE543",alpha=0.8) + coord_flip()
As we can notice the state with most Obese children and Teens is Delaware.
#Map for Obesity in Children ggplot(states, aes(x = long, y = lat, group = group, fill = Obese.children.and.adolescents)) + geom_polygon(color = "white") + scale_fill_gradient(name = "Percent Obese", low = "#B8D5EC", high = "#0A4B7D", guide = "colorbar", na.value="black", breaks = pretty_breaks(n = 5)) + labs(title="Obesity in Children and Teens", x = "Longitude",y = "latitude") + coord_map() + #adding States names to the states on the map geom_text(data=statenames, aes(x = long, y = lat, label = region), size=3)
Now let’s plot a complete Barplot of States and Percentage of Obese Children and Teens
ggplot(aes(x = reorder(region,Obese.children.and.adolescents),y = Obese.children.and.adolescents), data = statenames) + geom_col(color="black",fill="#F43E3E",width=1) + coord_flip() + labs(x = "States", y ="Percentage of Obese Children and Teens",title="Barplot of Obese Children and Teens")
The State with Highest Obese Teen and Children population is Delaware and the state with least obese teens and children is Utah.
In this project we firstly learned to scrape data using rvest
package from wikipedia and then analyzed and visualized the States with most Obese Adult and children population.We also learned how to create beautifull maps using ggplot
and maps
packages in R.
Hope you guys liked the article and is interesting enough to get you started with scraping data from any web document and start analyzing it yourself in R and create beautiful maps and plots.
Make sure to like and share it.Cheers !
Related Post
]]>Related Post
]]>numpy
arrays and then compare our estimates to those obtained using the linear_model
function from the statsmodels
package.
First, let’s import the modules and functions we’ll need. We’ll use numpy
for matrix and linear algebra. In the last post, we obtained the Boston housing data set from R’s MASS
library. In Python, we can find the same data set in the scikit-learn
module.
import numpy as np import pandas as pd from numpy.linalg import inv from sklearn.datasets import load_boston from statsmodels.regression.linear_model import OLS
Next, we can load the Boston data using the load_boston
function. For those who aren’t familiar with it, the Boston data set contains 14 economic, geographic, and demographic variables for 506 tracts in the city of Boston from the early 1990s as well as the median home price in each of the tracts.
Now we’re ready to start. Recall from my previous post that linear regression typically takes the form:
$$y=\beta X + \epsilon$$
where ‘y’ is a vector of the response variable, ‘X’ is the matrix of our feature variables (sometimes called the ‘design’ matrix), and β is a vector of parameters that we want to estimate. \(\epsilon\) is the error term; it represents features that affect the response, but are not explicitly included in our model.
From the boston
object, we will extract the features, which are conveniently already a numpy array, and assign it to X
. Our target variable is the median home value (in thousands of US dollars) for each tract. We assign the target to the variable y
.
# load the boston data set boston = load_boston() # obtain the feature matrix as a numpy array X = boston.data # obtain the target variable as a numpy array y = boston.target
If we examine the features, we can see that X
is the expected shape.
print(X.shape) (506, 14)
We can also look at the feature names.
feature_names = boston.feature_names print(feature_names) ['CRIM' 'ZN' 'INDUS' 'CHAS' 'NOX' 'RM' 'AGE' 'DIS' 'RAD' 'TAX' 'PTRATIO' 'B' 'LSTAT']
If you would like to see more details about these features and the data set, you can view the DESCR attribute of the boston
object.
print(boston.DESCR)
We will need to add a vector of ones to our feature matrix for the intercept term. We can do this easily in numpy
and then add it as the first column in our feature matrix.
# create vector of ones... int = np.ones(shape=y.shape)[..., None] #...and add to feature matrix X = np.concatenate((int, X), 1)
With the preparatory work out of the way, we can now implement the closed-form solution to obtain OLS parameter estimates.
$$\beta = (X^TX)^{-1}X^Ty$$
We do this in python using the numpy arrays we just created, the inv()
function, and the transpose()
and dot()
methods.
# calculate coefficients using closed-form solution coeffs = inv(X.transpose().dot(X)).dot(X.transpose()).dot(y)
Let’s examine them to see if they make sense. Here I convert the coeffs
array to a pandas
DataFrame and add the feature names as an index.
# extract the feature names of the boston data set and prepend the intercept feature_names = np.insert(boston.feature_names, 0, 'INT') # collect results into a DataFrame for pretty printing results = pd.DataFrame({'coeffs':coeffs}, index=feature_names) print(results.round(2)) coeffs INT 36.49 CRIM -0.11 ZN 0.05 INDUS 0.02 CHAS 2.69 NOX -17.80 RM 3.80 AGE 0.00 DIS -1.48 RAD 0.31 TAX -0.01 PTRATIO -0.95
Now we have our parameter vector β. These are the linear relationships between the median home value and each of the features. For example, we see that an increase of one unit in the number of rooms (RM) is associated with a $3,810 increase in home value. We can find the relationship between the response and any of the feature variables in the same manner, paying careful attention to the units in the data description. Notice that one of our features, ‘CHAS’, is a dummy variable which takes a value of 0 or 1 depending on whether or not the tract is adjacent to the Charles River. The coefficient of ‘CHAS’ tells us that homes in tracts adjacent to the Charles River (coded as 1) have a median price that is $2,690 higher than homes in tracts that do not border the river (coded as 0) when the other variables are held constant.
Now we’ll estimate the same model using the linear_model
function from statsmodels
and assign the results to coeffs_lm
.
# create a linear model and extract the parameters coeffs_lm = OLS(y, X).fit().params
Last of all, we place our newly-estimated parameters next to our original ones in the results
DataFrame and compare.
results['coeffs_lm'] = coeffs_lm print(results.round(2)) coeffs coeffs_lm INT 36.49 36.49 CRIM -0.11 -0.11 ZN 0.05 0.05 INDUS 0.02 0.02 CHAS 2.69 2.69 NOX -17.80 -17.80 RM 3.80 3.80 AGE 0.00 0.00 DIS -1.48 -1.48 RAD 0.31 0.31 TAX -0.01 -0.01 PTRATIO -0.95 -0.95 B 0.01 0.01 LSTAT -0.53 -0.53
Once again, our results are identical. Now you know how these estimates are obtained using the closed-form solution. Like I mentioned in my R post on the same topic, you’d never actually implement linear regression in this way. You would use the linear_model
function or the LinearRegression
function from the scikit-learn
package if you’d prefer to approach linear regression from a machine learning standpoint. These functions are very quick, require, very little code, and provides us with a number of diagnostic statistics, including , t-statistics, and p-values.
I have made the code from this post available at my Github here.
Related Post
]]>