Related Post

- Analytical and Numerical Solutions to Linear Regression Problems
- How to create a loop to run multiple regression models
- Regression model with auto correlated errors – Part 3, some astrology
- Regression model with auto correlated errors – Part 2, the models
- Regression model with auto correlated errors – Part 1, the data

In the first part of this exercise, we will build a logistic regression model to predict whether a student gets admitted into a university. Suppose that you are the administrator of a university department and you want to determine each applicant’s chance of admission based on their results on two exams. You have historical data from previous applicants that you can use as a training set for logistic regression. For each training example, you have the applicant’s scores on two exams and the admissions decision. Our task is to build a classification model that estimates an applicant’s probability of admission based the scores from those two exams.

library(ggplot2) library(data.table) library(magrittr)

The first two columns contains the exam scores and the third column contains the label.

ex2data1 <- fread("ex2data1.txt",col.names=c("Exam1","Exam2","Label")) head(ex2data1)Exam1 Exam2 Label 34.62366 78.02469 0 30.28671 43.89500 0 35.84741 72.90220 0 60.18260 86.30855 1 79.03274 75.34438 1 45.08328 56.31637 0

Before starting to implement any learning algorithm, it is always good to visualize the data if possible.

cols %ggplot(aes(x=Exam1,y=Exam2,color=factor(Label)))+geom_point(size = 4, shape = 19,alpha=0.6)+ scale_colour_manual(values = cols,labels = c("Not admitted", "Admitted"),name="Admission Status")

This is the formula:

Let’s code the sigmoid function so that we can call it in the rest of our programs.

my_sigmoid=function(z){ 1/(1+exp(-z)) }

For large positive values of x, the sigmoid should be close to 1, while for large negative values, the sigmoid should be close to 0. Evaluating sigmoid(0) should give exactly 0.5.

Let’s check!

my_sigmoid(0)0.5

my_sigmoid(10^10)1

my_sigmoid(-10^10)0

We can visualize the sigmoid function graphically:

x=seq(-10,10, by=0.1) y=my_sigmoid(x) plot(x,y,type="l",col="red",xlab="",ylab="",main="Sigmoid function (Logistic curve)") abline(v=0,lty = 2,col="gray 70") abline(h=0.5,lty = 2,col="gray70")

This is the formula:

Add ones for the intercept term:

y=ex2data1$Label X=cbind(1,ex2data1$Exam1,ex2data1$Exam2) head(X)1 34.62366 78.02469 1 30.28671 43.89500 1 35.84741 72.90220 1 60.18260 86.30855 1 79.03274 75.34438 1 45.08328 56.31637

Initialize fitting parameters:

initial_theta =matrix(rep(0,ncol(X)))

The function below calculates cost.

my_cost=function(theta, X, y){ cost_gradient=list() h_theta= my_sigmoid(X%*%theta) cost= 1/nrow(X)*sum(-y*log(h_theta)-(1-y)*log(1-h_theta)) return(cost) }

What is the cost for the initial theta parameters, which are all zeros?

round(my_cost(initial_theta, X, y),3)0.693

The function below calculates gradient.

my_gradient=function(theta, X, y){ h_theta= my_sigmoid(X%*%theta) ## OPTION 1-- looping # gradient=rep(0,ncol(X)) # for(j in 1:ncol(X)){ # for(i in 1:nrow(X)){ # gradient[j]=gradient[j]+1/nrow(X)*(my_sigmoid(X[i,]%*%theta)-y[i])*X[i,j] # } # } # option2-more succint gradient= 1/nrow(X)*(my_sigmoid(t(theta)%*%t(X))-t(y))%*%X return(gradient) }

The gradient for the initial theta parameters, which are all zeros, is shown below

round(my_gradient(initial_theta, X, y),3)-0.1 -12.009 -11.263

We can use gradient descent to get the optimal theta values but using optimazation libraries converges quicker. So, let’s use the optim general-purpose Optimization in R to get the required theta values and the associated cost.

optimized=optim(par=initial_theta,X=X,y=y,fn=my_cost,gr=my_gradient) names(optimized)'par' 'value' 'counts' 'convergence' 'message'

cat("The optimized theta values are: ") round(optimized$par,3)The optimized theta values are: -25.165 0.206 0.202

cat('The cost at the final theta values is: ') round(optimized$value,3)The cost at the final theta values is: 0.203

Now, let’s plot the decision boundary. Only 2 points are required to define a line, so let’s choose two endpoints.

plot_x = c(min(ex2data1$Exam1)-2,max(ex2data1$Exam1)+2) # Calculate the decision boundary line plot_y = (-1/optimized$par[3])*(optimized$par[2]*plot_x+optimized$par[1]) ## Calculate slope and intercept slope =(plot_y[2]-plot_y[1])/(plot_x[2]-plot_x[1]) intercept=plot_y[2]-slope*plot_x[2] cols %ggplot(aes(x=Exam1,y=Exam2,color=factor(Label)))+geom_point(size = 4, shape = 19,alpha=0.6)+ scale_colour_manual(values = cols,labels = c("Not admitted", "Admitted"),name="Admission Status")+ geom_abline(intercept = intercept, slope = slope,col="purple",show.legend=TRUE)

After learning the parameters, you can use the model to predict whether a particular student will be admitted. For example, for a student with an Exam 1 score of 45 and an Exam 2 score of 85, the probability of admission is shown below.

round(my_sigmoid(c(1,45,85)%*%optimized$par),3)0.776

Now, let’s calculate the model accuracy. Let’s use a threshould of 0.5.

ex2data1$fitted_result=my_sigmoid(X%*%optimized$par) ex2data1$fitted_result_label=ifelse(ex2data1$fitted_result>=0.5,1,0) head(ex2data1)Exam1 Exam2 Label fitted_result fitted_result_label 34.62366 78.02469 0 9.102720e-02 0 30.28671 43.89500 0 4.220398e-05 0 35.84741 72.90220 0 4.389943e-02 0 60.18260 86.30855 1 9.904316e-01 1 79.03274 75.34438 1 9.982001e-01 1 45.08328 56.31637 0 1.079133e-02 0

accuracy=sum(ex2data1$Label==ex2data1$fitted_result_label)/(nrow(ex2data1)) accuracy0.89

In this part of the exercise, we will implement regularized logistic regression to predict whether microchips from a fabrication plant passes quality assurance (QA). During QA, each microchip goes through various tests to ensure it is functioning correctly. Suppose you are the product manager of the factory and you have the test results for some microchips on two different tests. From these two tests, you would like to determine whether the microchips should be accepted or rejected.

ex2data2 <- fread("ex2data2.txt",col.names=c("Test1","Test2","Label")) head(ex2data2)Test1 Test2 Label 0.051267 0.69956 1 -0.092742 0.68494 1 -0.213710 0.69225 1 -0.375000 0.50219 1 -0.513250 0.46564 1 -0.524770 0.20980 1

cols %ggplot(aes(x=Test1,y=Test2,color=factor(Label)))+geom_point(size = 4, shape = 19,alpha=0.6)+ scale_colour_manual(values = cols,labels = c("Failed", "Passed"),name="Test Result")

The above figure shows that our dataset cannot be separated into positive and negative examples by a straight-line through the plot. Therefore, a straightforward application of logistic regression will not perform well on this dataset since logistic regression will only be able to find a linear decision boundary.

One way to fit the data better is to create more features from each data point. Let’s map the features into all polynomial terms of x1 and x2 up to the sixth power.

new_data=c() for(i in 1:6){ for(j in 0:i){ temp= (ex2data2$Test1)^i+(ex2data2$Test2)^(i-j) new_data=cbind(new_data,temp) } } colnames(new_data)=paste0("V",1:ncol(new_data)) head(new_data)V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 ... V18 V19 V20 V21 V22 V23 V24 V25 V26 V27 0.750827 1.051267 0.4920125 0.7021883 1.002628 0.34248835 0.48951894 0.69969475 1.0001347 0.23950380 ... 0.489384548 0.6995604 1.0000004 0.11720601 0.16754246 0.23949691 0.34235362 0.48938421 0.6995600 1.000000 0.592198 0.907258 0.4777439 0.6935411 1.008601 0.32053699 0.46834512 0.68414232 0.9992023 0.22016895 ... 0.469135943 0.6849331 0.9999931 0.10325661 0.15075249 0.22009561 0.32133531 0.46914344 0.6849406 1.000001 0.478540 0.786290 0.5248820 0.7379220 1.045672 0.32197261 0.46944951 0.68248944 0.9902394 0.23172821 ... 0.478764279 0.6918042 0.9995542 0.11014216 0.15906514 0.22973755 0.33182843 0.47930533 0.6923453 1.000095 0.127190 0.625000 0.3928198 0.6428150 1.140625 0.07391533 0.19946042 0.44945563 0.9472656 0.08337761 ... 0.244779025 0.4947742 0.9925842 0.01882106 0.03472131 0.06638313 0.12943062 0.25497571 0.5049709 1.002781 -0.047610 0.486750 0.4802462 0.7290656 1.263426 -0.03424282 0.08161744 0.33043683 0.8647968 0.11640420 ... 0.181204639 0.4300240 0.9643840 0.02847289 0.04017018 0.06529107 0.11924025 0.23510051 0.4839199 1.018280 -0.314970 0.475230 0.3193996 0.4851836 1.275384 -0.13527846 -0.10049699 0.06528697 0.8554870 0.07777351 ... 0.004219529 0.1700035 0.9602035 0.02096929 0.02129048 0.02282143 0.03011858 0.06490005 0.2306840 1.020884

y=ex2data2$Label X=new_data

As a result of this mapping, our vector of two features (the scores on two QA tests) has been transformed into a 28-dimensional vector. A logistic regression classifier trained on this higher-dimension feature vector will have a more complex decision boundary and will appear nonlinear when drawn in our 2-dimensional plot. While the feature mapping allows us to build a more expressive classifier, it also more susceptible to overfitting. In the next parts of the exercise, we will implement regularized logistic regression to fit the data and also see for ourselves how regularization can help combat the overfitting problem.

my_cost2=function(theta, X, y,lambda){ cost_gradient=list() # cost_and_gradient compute cost and gradient for logistic regression using theta h_theta= my_sigmoid(X%*%theta) cost= 1/nrow(X)*sum(-y*log(h_theta)-(1-y)*log(1-h_theta))+lambda/(2*nrow(X))*sum(theta^2) return(cost) } my_gradient2=function(theta, X, y,lambda){ h_theta= my_sigmoid(X%*%theta) ## OPTION 1-- looping gradient=rep(0,ncol(X)) for(j in 2:ncol(X)){ for(i in 1:nrow(X)){ gradient[1]=gradient[1]+1/nrow(X)*(my_sigmoid(X[i,]%*%theta)-y[i])*X[i,1] gradient[j]=gradient[j]+1/nrow(X)*(my_sigmoid(X[i,]%*%theta)-y[i])*X[i,j]+lambda/(nrow(X))*theta[j] } } # option2-more succint # gradient= 1/nrow(X)*(my_sigmoid(t(theta)%*%t(X))-t(y))%*%X+(lambda/(nrow(X)))*c(0,theta[-1]) return(gradient) }

Let’s initialize theta:

initial_theta =matrix(rep(0,ncol(X)))

What is the cost for the initial theta?

round(my_cost2(initial_theta,X,y,lambda=10),3)0.693

Now, since we have the cost function that we want to optimize and the gradient, we can use the optimization function optim to find the optimal theta values.

We have to try various values of lambda and select the best lambda based on cross-validation. But for now, let’s just take lambda=1.

optimized=optim(par=initial_theta,X=X,y=y,lambda=1,fn=my_cost2,gr=my_gradient2,method="BFGS")

The theta values from the optimization are shown below.

matrix(as.vector(optimized$par),ncol=9)0.49507083 -0.05202124 -0.08527443 -0.1575100 -0.035020544 -0.11079198 0.01169745 -0.12174228 -0.12234383 0.05639364 -0.02062364 0.01959276 -0.1129876 -0.003622936 -0.06626957 0.04309506 -0.13996607 -0.01747664 -0.15688844 -0.05837426 0.05099037 -0.1398877 -0.092568188 -0.09316975 -0.14349836 -0.09544366 0.01392097

ex2data2$fitted_result=my_sigmoid(X%*%optimized$par)

Let’s use a threshold of 0.5.

ex2data2$fitted_result_label=ifelse(ex2data2$fitted_result>=0.5,1,0) head(ex2data2,10)Test1 Test2 Label fitted_result fitted_result_label 0.051267 0.699560 1 0.4765027 0 -0.092742 0.684940 1 0.4629791 0 -0.213710 0.692250 1 0.4410811 0 -0.375000 0.502190 1 0.4701446 0 -0.513250 0.465640 1 0.4459059 0 -0.524770 0.209800 1 0.4554844 0 -0.398040 0.034357 1 0.4730207 0 -0.305880 -0.192250 1 0.4618593 0 0.016705 -0.404240 1 0.4735035 0 0.131910 -0.513890 1 0.4640032 0

Now, we can evaluate the fit by calculating various metrics such as F1 score, precision and recall. Let’s just see accuracy here. You can see the values of the other metrics here.

accuracy=sum(ex2data2$Label==ex2data2$fitted_result_label)/nrow(ex2data2) cat("The accuracy is: ") round(accuracy,3)The accuracy is: 0.653

Logistic regression is a classification machine learning technique. In this blog post, we saw how to implement logistic regression with and without regularization. In the first part, we used it to predict if a student will be admitted to a university and in the second part, we used it to predict whether microchips from a fabrication plant pass quality assurance. We used special optimization function in lieu of gradient descent to get the optimal values of the coefficients. The data sets are from the Coursera machine learning course offered by Andrew Ng. The course is offered with Matlab/Octave. I am doing the exercises in that course with R. You can get the code from this Github repository.

Related Post

- Analytical and Numerical Solutions to Linear Regression Problems
- How to create a loop to run multiple regression models
- Regression model with auto correlated errors – Part 3, some astrology
- Regression model with auto correlated errors – Part 2, the models
- Regression model with auto correlated errors – Part 1, the data

Related Post

- Logistic Regression Regularized with Optimization
- How to create a loop to run multiple regression models
- Regression model with auto correlated errors – Part 3, some astrology
- Regression model with auto correlated errors – Part 2, the models
- Regression model with auto correlated errors – Part 1, the data

In this part of this exercise, we will implement linear regression with one variable to predict profits for a food truck. Suppose you are the CEO of a restaurant franchise and are considering different cities for opening a new outlet. The chain already has trucks in various cities and you have data for profits and populations from cities. We would like to use this data to help us select which city to expand to next.

The file ex1data1.txt contains the dataset for our linear regression problem. The first column is the population of a city and the second column is the profit of a food truck in that city. A negative value for profit indicates a loss. The first column refers to the population size in 10,000s and the second column refers to the profit in $10,000s.

library(ggplot2) library(data.table) library(magrittr) library(caret) library(fields) library(plot3D)

**Load the data and display first 6 observations**

ex1data1 <- fread("ex1data1.txt",col.names=c("population","profit")) head(ex1data1)population profit 1: 6.1101 17.5920 2: 5.5277 9.1302 3: 8.5186 13.6620 4: 7.0032 11.8540 5: 5.8598 6.8233 6: 8.3829 11.8860

Before starting on any task, it is often useful to understand the data by visualizing it. For this dataset, we can use a scatter plot to visualize the data, since it has only two properties to plot (profit and population). Many other problems that we encounter in real life are multi-dimensional and can’t be plotted on a 2-d plot.

ex1data1%>%ggplot(aes(x=population, y=profit))+ geom_point(color="blue",size=4,alpha=0.5)+ ylab('Profit in $10,000s')+ xlab('Population of City in 10,000s')+ggtitle ('Figure 1: Scatter plot of training data')+ theme(plot.title = element_text(size = 16,colour="red"))

In this part, we will fit linear regression parameters θ to our dataset using gradient descent.

To implement gradient descent, we need to calculate the cost, which is given by:

As we perform gradient descent to learn minimize the cost function J(θ), it is helpful to monitor the convergence by computing the cost. In this section, we will implement a function to calculate J(θ) so we can check the convergence of our gradient descent implementation.

But remember, before we go any further we need to add the θ intercept term.

X=cbind(1,ex1data1$population) y=ex1data1$profit head(X)[,1] [,2] [1,] 1 6.1101 [2,] 1 5.5277 [3,] 1 8.5186 [4,] 1 7.0032 [5,] 1 5.8598 [6,] 1 8.3829

The function below calcuates cost based on the equation given above.

computeCost=function(X,y,theta){ z=((X%*%theta)-y)^2 return(sum(z)/(2*nrow(X))) }

Now, we can calculate the initial cost by initilizating the initial parameters to 0.

theta=matrix(rep(0,ncol(X))) round(computeCost(X,y,theta),2)32.07

Next, we will implement gradient descent by calling the *computeCost* function above. A good way to verify that gradient descent is working correctly is to look at the value of J(θ) and check that it is decreasing with each step. Our value of J(θ) should never increase, and should converge to a steady value by the end of the algorithm.

The gradient descent function below returns the cost in every iteration and the optimal parameters for the number of iterations and learning rate alpha specified.

gradientDescent=function(X, y, theta, alpha, iters){ gd=list() cost=rep(0,iters) for(k in 1:iters){ z=rep(0,ncol(X)) for(i in 1:ncol(X)){ for(j in 1:nrow(X)){ z[i]=z[i]+(((X[j,]%*%theta)-y[j])*X[j,i]) } } theta= theta-((alpha/nrow(X))*z) cost[k]=computeCost(X,y,theta) } gd$theta= theta gd$cost=cost gd }

Now, let’s use the *gradientDescent* function to find the parameters and we have to make sure that our cost never increases. Let’s use 1500 iterations and a learning rate alpha of 0.04 for now. Later, we will see the effect of these values in our application.

iterations = 1500 alpha = 0.01 theta= matrix(rep(0, ncol(X))) gradientDescent_results=gradientDescent(X,y,theta,alpha,iterations) theta=gradientDescent_results$theta theta[,1] [1,] -3.630291 [2,] 1.166362

We expect that the cost should be decreasing or at least should not be at all increasing if our implementaion is correct. Let’s plot the cost fucntion as a function of number of iterations and make sure our implemenation makes sense.

data.frame(Cost=gradientDescent_results$cost,Iterations=1:iterations)%>% ggplot(aes(x=Iterations,y=Cost))+geom_line(color="blue")+ ggtitle("Cost as a function of number of iteration")+ theme(plot.title = element_text(size = 16,colour="red"))

Here it is the plot:

As the plot above shows, the cost decrases with number of iterations and gets almost close to convergence with 1500 iterations.

Now, since we know the parameters (slope and intercept), we can plot the linear fit on the scatter plot.

ex1data1%>%ggplot(aes(x=population, y=profit))+ geom_point(color="blue",size=3,alpha=0.5)+ ylab('Profit in $10,000s')+ xlab('Population of City in 10,000s')+ ggtitle ('Figure 1: Scatter plot of training data') + geom_abline(intercept = theta[1], slope = theta[2],col="red",show.legend=TRUE)+ theme(plot.title = element_text(size = 16,colour="red"))+ annotate("text", x = 12, y = 20, label = paste0("Profit = ",round(theta[1],4),"+",round(theta[2],4)," * Population"))

To understand the cost function J(θ) better, let’s now plot the cost over a 2-dimensional grid of θ0 and θ1 values. The global minimum is the optimal point for θ0 and θ1, and each step of gradient descent moves closer to this point.

Intercept=seq(from=-10,to=10,length=100) Slope=seq(from=-1,to=4,length=100) # initialize cost values to a matrix of 0's Cost = matrix(0,length(Intercept), length(Slope)); for(i in 1:length(Intercept)){ for(j in 1:length(Slope)){ t = c(Intercept[i],Slope[j]) Cost[i,j]= computeCost(X, y, t) } } persp3D(Intercept,Slope,Cost,theta=-45, phi=25, expand=0.75,lighting = TRUE, ticktype="detailed", xlab="Intercept", ylab="Slope", zlab="",axes=TRUE, main="Surface") image.plot(Intercept,Slope,Cost, main="Contour, showing minimum") contour(Intercept,Slope,Cost, add = TRUE,n=30,labels='') points(theta[1],theta[2],col='red',pch=4,lwd=6)

Since linear regression has closed-form solution, we can solve it analytically and it is called normal equation. It is given by the formula below. we do not need to iterate or choose learning curve. However, we need to calculate inverse of a matrix , which make it slow if the number of records is very large. Gradient descent is applicable to other machine learning techniques as well. Further, gradient descent method is more appropriate even to linear regression when the number of observations is very large.

θ=(X^{T}X)^{−1}X^{T}y

theta2=solve((t(X)%*%X))%*%t(X)%*%y theta2[,1] [1,] -3.895781 [2,] 1.193034

There is very small difference between the parameters we got from normal equation and using gradient descent. Let’s increase the number of iteration and see if they get closer to each other. I increased the number of iterations from 1500 to 15000.

iterations = 15000 alpha = 0.01 theta= matrix(rep(0, ncol(X))) gradientDescent_results=gradientDescent(X,y,theta,alpha,iterations) theta=gradientDescent_results$theta theta[,1] [1,] -3.895781 [2,] 1.193034

As you can see, now the results from normal equation and gradient descent are the same.

By the way, we can use packages develoed by experts in the field and perform our machine learning tasks. There are many machine learning packages in R for differnt types of machine learning tasks. To verify that we get the same results, let’s use the caret package, which is among the most commonly used machine learning packages in R.

my_lm <- train(profit~population, data=ex1data1,method = "lm") my_lm$finalModel$coefficients(Intercept) -3.89578087831187 population 1.1930336441896

In this part of the exercise, we will implement linear regression with multiple variables to predict the prices of houses. Suppose you are selling your house and you want to know what a good market price would be. One way to do this is to first collect information on recent houses sold and make a model of housing prices. The file ex1data2.txt contains a training set of housing prices in Port- land, Oregon. The first column is the size of the house (in square feet), the second column is the number of bedrooms, and the third column is the price of the house.

ex1data2 <- fread("ex1data2.txt",col.names=c("size","bedrooms","price")) head(ex1data2)size bedrooms price 1: 2104 3 399900 2: 1600 3 329900 3: 2400 3 369000 4: 1416 2 232000 5: 3000 4 539900 6: 1985 4 299900

House sizes are about 1000 times the number of bedrooms. When features differ by orders of magnitude, first performing feature scaling can make gradient descent converge much more quickly.

ex1data2=as.data.frame(ex1data2) for(i in 1:(ncol(ex1data2)-1)){ ex1data2[,i]=(ex1data2[,i]-mean(ex1data2[,i]))/sd(ex1data2[,i]) }

Previously, we implemented gradient descent on a univariate regression problem. The only difference now is that there is one more feature in the matrix X. The hypothesis function and the batch gradient descent update rule remain unchanged.

X=cbind(1,ex1data2$size,ex1data2$bedrooms) y=ex1data2$price head(X) iterations = 6000 alpha = 0.01 theta= matrix(rep(0, ncol(X))) gradientDescent_results=gradientDescent(X,y,theta,alpha,iterations) theta=gradientDescent_results$theta theta[,1] [,2] [,3] [1,] 1 0.13000987 -0.2236752 [2,] 1 -0.50418984 -0.2236752 [3,] 1 0.50247636 -0.2236752 [4,] 1 -0.73572306 -1.5377669 [5,] 1 1.25747602 1.0904165 [6,] 1 -0.01973173 1.0904165 [,1] [1,] 340412.660 [2,] 110631.050 [3,] -6649.474

theta2=solve((t(X)%*%X))%*%t(X)%*%y theta2[,1] [1,] 340412.660 [2,] 110631.050 [3,] -6649.474

ex1data2 <- fread("ex1data2.txt",col.names=c("size","bedrooms","price")) my_lm <- train(price~size+bedrooms, data=ex1data2,method = "lm", preProcess = c("center","scale")) my_lm$finalModel$coefficients(Intercept) 340412.659574468 size 110631.050278846 bedrooms -6649.4742708198

In this post, we saw how to implement numerical and analytical solutions to linear regression problems using R. We also used `caret`

-the famous R machine learning package- to verify our results. The data sets are from the Coursera machine learning course offered by Andrew Ng. The course is offered with Matlab/Octave. I am doing the exercises in that course with R. You can get the code from this Github repository.

Related Post

- Logistic Regression Regularized with Optimization
- How to create a loop to run multiple regression models
- Regression model with auto correlated errors – Part 3, some astrology
- Regression model with auto correlated errors – Part 2, the models
- Regression model with auto correlated errors – Part 1, the data

Related Post

This week short blog post is on visualizing streaming data using shiny. The dashboard updates automatically to incorporate newly added data. In the end of this post you will find a video tutorial on how to send text and email alerts, based on streaming sensor data, when there is an abnormal reading or when a sensor fails.

library(ggplot2) library(dplyr) data("diamonds") while(TRUE){ temp=sample_frac(diamonds,0.1) write.csv(temp, paste0("sampled", gsub("[^0-9]","",Sys.time()),".csv"), row.names = FALSE) Sys.sleep(10) # Suspend execution of R expressions. The time interval to suspend execution for, in seconds. }

Then, the following are the shiny code that fetch the newly generated data and create a shiny dashboard.

**ui.R**

library(shiny) fluidPage( tags$h2("Visualizing Streaming Data with Shiny", style="color:blue;text-align:center"), plotOutput("plot1",height = "600px") )

**server.R**

library(shiny) library(data.table) library(ggplot2) library(gridExtra) library(readr) IsThereNewFile=function(){ # cheap function whose values over time will be tested for equality; # inequality indicates that the underlying value has changed and needs to be # invalidated and re-read using valueFunc filenames <- list.files(pattern="*.csv", full.names=TRUE) length(filenames) } ReadAllData=function(){ # A function that calculates the underlying value filenames <- list.files(pattern="*.csv", full.names=TRUE) read_csv(filenames[length(filenames)]) } function(input, output, session) { sampled_data <- reactivePoll(10, session,IsThereNewFile, ReadAllData) # 10: number of milliseconds to wait between calls to checkFunc output$plot1<-renderPlot({ sampled_data= sampled_data() g1= ggplot(sampled_data, aes(depth, fill = cut, colour = cut)) + geom_density(alpha = 0.1) +xlim(55, 70)+ggtitle("Distribution of Depth by Cut")+ theme(plot.title = element_text(color="darkred",size=18,hjust = 0.5), axis.text.y = element_text(color="blue",size=12,hjust=1), axis.text.x = element_text(color="darkred",size=12,hjust=.5,vjust=.5), axis.title.x = element_text(color="red", size=14), axis.title.y = element_text(size=14)) g2=ggplot(sampled_data, aes(carat, ..count.., fill = cut)) + geom_density(position = "stack")+ggtitle("Total Carat by Count")+ theme(plot.title = element_text(color="purple",size=18,hjust = 0.5), axis.text.y = element_text(color="blue",size=12,hjust=1), axis.text.x = element_text(color="darkred",size=12,hjust=.5,vjust=.5), axis.title.x = element_text(color="red", size=14), axis.title.y = element_text(size=14)) g3=ggplot(sampled_data, aes(carat, ..count.., fill = cut)) + geom_density(position = "fill")+ggtitle("Conditional Density Estimate")+ theme(plot.title = element_text(color="black",size=18,hjust = 0.5), axis.text.y = element_text(color="blue",size=12,hjust=1), axis.text.x = element_text(color="darkred",size=12,hjust=.5,vjust=.5), axis.title.x = element_text(color="red", size=14), axis.title.y = element_text(size=14)) g4=ggplot(sampled_data,aes(carat,price))+geom_boxplot()+facet_grid(.~cut)+ ggtitle("Price by Carat for each cut")+ theme(plot.title = element_text(color="darkblue",size=18,hjust = 0.5), axis.text.y = element_text(color="blue",size=12,hjust=1), axis.text.x = element_text(color="darkred",size=12,hjust=.5,vjust=.5), axis.title.x = element_text(color="red", size=14), axis.title.y = element_text(size=14)) grid.arrange(g1,g2,g3,g4) }) }

Related Post

Related Post

To showcase these techniques we will use one of the readily available dataset from the UCI_Dataset_StudentKnowledge.

We will download the dataset in the current R studio environment:

library(readr) StudentKnowledgeData <- read_csv("YourdownloadFolderPath/StudentKnowledgeData.csv") View(StudentKnowledgeData)

This dataset is comma-separated with a header line. We will check for `NA`

values and check if there are any categorical variable to ensure the data is ready for processing for the clustering algorithms. As this data set has low feature vector we will not focus on feature selection aspects but will work with all the available features.

mydata = StudentKnowledgeData #let us analyze the data to find if categorical variables are there and if so #transform them. mydata = as.data.frame(unclass(mydata)) summary(mydata) dim(mydata) # We can now remove any records that have NAs myDataClean = na.omit(mydata) dim(myDataClean) summary(myDataClean)[1] 402 5 STG SCG STR LPR Min. :0.0000 Min. :0.0000 Min. :0.0100 Min. :0.0000 1st Qu.:0.2000 1st Qu.:0.2000 1st Qu.:0.2700 1st Qu.:0.2500 Median :0.3025 Median :0.3000 Median :0.4450 Median :0.3300 Mean :0.3540 Mean :0.3568 Mean :0.4588 Mean :0.4324 3rd Qu.:0.4800 3rd Qu.:0.5100 3rd Qu.:0.6800 3rd Qu.:0.6500 Max. :0.9900 Max. :0.9000 Max. :0.9500 Max. :0.9900

Once we have done pre-processing to ensure the data is ready for further applications. Let us try and scale and center this dataset.

scaled_data = as.matrix(scale(myDataClean))

Let us try to create the clusters for this data. As we can observe this data doesnot have a pre-defined class/output type defined and so it becomes necessary to know what will be an optimal number of clusters.Let us choose random value of cluster numbers for now and see how the clusters are created.

Let us start with k=3 and check what the results are. R comes with a builtin kmeans() function and we do not need to import any extra package for this.

#Let us apply kmeans for k=3 clusters kmm = kmeans(scaled_data,3,nstart = 50,iter.max = 15) #we keep number of iter.max=15 to ensure the algorithm converges and nstart=50 to #ensure that atleat 50 random sets are choosen kmmK-means clustering with 3 clusters of sizes 93, 167, 142 Cluster means: STG SCG STR LPR PEG 1 0.573053974 0.3863411 0.2689915 1.3028712 0.1560779 2 -0.315847301 -0.4009366 -0.3931942 -0.1794893 -0.8332218 3 -0.003855777 0.2184978 0.2862481 -0.6421993 0.8776957 Clustering vector: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 ................................................................................... Within cluster sum of squares by cluster: [1] 394.5076 524.4177 497.7787 (between_SS / total_SS = 29.3 %) Available components: [1] "cluster" "centers" "totss" "withinss" "tot.withinss" [6] "betweenss" "size" "iter" "ifault"

When we check the (between_SS / total_SS) we find it to be low. This ratio actually accounts for the amount of total sum of squares of the data points which is between the clusters. We want to increase this value and as we increase the number of clusters we see it increasing , but we do not want to overfit the data. So we see that with k=401 in this we will have 402 clusters which completely overfits the data. So the idea is to find such a value of k for which the model is not overfitting and at the same time clusters the data as per the actual distribution. Let us now approach how we will solve this problem of finding the best number of clusters.

The elbow method looks at the percentage of variance explained as a function of the number of clusters: One should choose a number of clusters so that adding another cluster doesn’t give much better modeling of the data. More precisely, if one plots the percentage of variance explained by the clusters against the number of clusters, the first clusters will add much information (explain a lot of variance), but at some point the marginal gain will drop, giving an angle in the graph. The number of clusters is chosen at this point, hence the “elbow criterion”. This “elbow” cannot always be unambiguously identified.

#Elbow Method for finding the optimal number of clusters set.seed(123) # Compute and plot wss for k = 2 to k = 15. k.max <- 15 data <- scaled_data wss <- sapply(1:k.max, function(k){kmeans(data, k, nstart=50,iter.max = 15 )$tot.withinss}) wss plot(1:k.max, wss, type="b", pch = 19, frame = FALSE, xlab="Number of clusters K", ylab="Total within-clusters sum of squares")[1] 2005.0000 1635.8573 1416.7041 1253.9959 1115.4657 1026.0506 952.4835 887.7202 [9] 830.8277 780.2121 735.6714 693.7745 657.0939 631.5901 608.3576

Therefore for k=4 the between_ss/total_ss ratio tends to change slowly and remain less changing as compared to other k’s.so for this data k=4 should be a good choice for number of clusters however k=5 also seems to be a potential candidate. So how do we decide what will be the optimal choice. So we look at the second approach which comes with a new package.

The k-means model is “almost” a Gaussian mixture model and one can construct a likelihood for the Gaussian mixture model and thus also determine information criterion values.

We install the `mclust`

package and we will use the Mclust method of it.Determine the optimal model and number of clusters according to the Bayesian Information Criterion for expectation-maximization, initialized by hierarchical clustering for parameterized Gaussian mixture models. In this method we had set the modelNames parameter to mclust.options(“emModelNames”) so that it includes only those models for evaluation where the number of observation is greater than the dimensions of the dataset here 402>5. The best model selected was EVI (Equal Volume but Variable shape and using Identity Matrix for the eigen values) with number of clusters 3 and 4. So based on this and the previous method the natural number of clusters choice was 4. To further validate this we checked for the BIC(Bayesian Information Criterion for k means) and it seems to validate the findings of Mclust package showing that cluster choice of 3 and 4 are the best and of highest value for this distribution of data.

d_clust <- Mclust(as.matrix(scaled_data), G=1:15, modelNames = mclust.options("emModelNames")) d_clust$BIC plot(d_clust)Bayesian Information Criterion (BIC): EII VII EEI VEI EVI VVI EEE EVE 1 -5735.105 -5735.105 -5759.091 -5759.091 -5759.091 -5759.091 -5758.712 -5758.712 2 -5731.019 -5719.188 -5702.988 -5635.324 -5725.379 -5729.256 -5698.095 -5707.733 3 -5726.577 -5707.840 -5648.033 -5618.274 -5580.305 -5620.816 -5693.977 -5632.555 .................................................................................. VEE VVE EEV VEV EVV VVV 1 -5758.712 -5758.712 -5758.712 -5758.712 -5758.712 -5758.712 2 -5704.051 -5735.383 -5742.110 -5743.216 -5752.709 -5753.597 3 -5682.312 -5642.217 -5736.306 -5703.742 -5717.796 -5760.915 .............................................................. Top 3 models based on the BIC criterion: EVI,3 EVI,4 EEI,5 -5580.305 -5607.980 -5613.077 > plot(d_clust) Model-based clustering plots: 1: BIC 2: classification 3: uncertainty 4: density Selection: 1

The plot can be seen below where k=3 and k=4 are the best choices available.

As we can see from the two approaches we can to a certain extent be sure of what an optimal value for the number of clusters can be for a clustering problem. There are few other techniques which can also be used. Let us take at one such approach using the `NbClust`

NbClust package provides 30 indices for determining the number of clusters and proposes to user the best clustering scheme from the different results obtained by varying all combinations of number of clusters, distance measures, and clustering methods

install.packages("NbClust",dependencies = TRUE) library(NbClust) nb <- NbClust(scaled_data, diss=NULL, distance = "euclidean", min.nc=2, max.nc=5, method = "kmeans", index = "all", alphaBeale = 0.1) hist(nb$Best.nc[1,], breaks = max(na.omit(nb$Best.nc[1,])))

There is an important point to here that this method always takes into the majority of the indexes for each cluster size. So it is important to understand which of the indexes are relevant to the data and based on it to determine if the best choice is the maximum value suggested or any other value.

As we see below looking at the Second differences D-index graph we know it is quite clear the best number of clusters is k=4.

Hope this gives some of the insight how to use different resources in R to determine the optimal number of clusters for relocation algorithms like Kmeans or EM.

Related Post

Related Post

- Logistic Regression Regularized with Optimization
- Analytical and Numerical Solutions to Linear Regression Problems
- Regression model with auto correlated errors – Part 3, some astrology
- Regression model with auto correlated errors – Part 2, the models
- Regression model with auto correlated errors – Part 1, the data

So models will be something like this: (dx is dependent and ix is independent variable, v are other variables)

dx1 = ix1 + v1 + v2 + v3 dx1 = ix2 + v1 + v2 + v3 dx1 = ix3 + v1 + v2 + v3 ... dx1 = ix100 + v1 + v2 + v3 dx2 = ix1 + v1 + v2 + v3 ... dx100 = ix100 + v1 + v2 + v3

The output should be a data frame with 5 columns, including dependent variable, independent variable, beta estimate, standard error and the p-value.

Something like this (those numbers are just for illustration purposes):

d i beta se pvalue 1 dx1 ix1 0.1 0.002 0.950 2 dx2 ix2 0.2 0.002 0.826 3 dx3 ix3 0.3 0.005 0.123

OK, now lets begin: the dataset that I received had all the variables in columns and observations in rows (the data is not real, just random numbers for illustration purposes):

id dx1 dx2 ... dx100 ix1 ... 1x100 v1 v2 v3 10 324 124 ... 214 32 ... 32 ax b4 c3 11 431 982 ... 114 12 ... 77 ce b2 c5 12 545 123 ... 104 34 ... 11 ar c2 a5 ....

Create vectors for the position of the dependent and independent variables in your dataset.

# outcome out_start=2 out_end= 101 out_nvar=out_end-out_start+1 out_variable=rep(NA, out_nvar) out_beta=rep(NA, out_nvar) out_se = rep(NA, out_nvar) out_pvalue=rep(NA, out_nvar) # exposure exp_start=102 exp_end=203 exp_nvar=exp_end-exp_start+1 exp_variable=rep(NA, exp_nvar) exp_beta=rep(NA, exp_nvar) exp_se = rep(NA, out_nvar) exp_pvalue=rep(NA, exp_nvar) number=1

I used linear mixed effect model and therefore I loaded the `lme4`

library. The loop should work with other regression analysis (i.e. linear regression), if you modify it according to your regression model. If you don’t know which part to modify, leave a comment below and I will try to help.

As other loops, this call variables of interest one by one and for each of them extract and store the betas, standard error and p value. Remember, this code is specific for linear mixed effect models.

library(lme4) for (i in out_start:out_end){ outcome = colnames(dat)[i] for (j in exp_start:exp_end){ exposure = colnames(dat)[j] model <- lmer(get(outcome) ~ get(exposure) + v1 + (1|v2) + (1|v3), na.action = na.exclude, data=dat) Vcov <- vcov(model, useScale = FALSE) beta <- fixef(model) se <- sqrt(diag(Vcov)) zval <- beta / se pval <- 2 * pnorm(abs(zval), lower.tail = FALSE) out_beta[number] = as.numeric(beta[2]) out_se[number] = as.numeric(se[2]) out_pvalue[number] = as.numeric(pval[2]) out_variable[number] = outcome number = number + 1 exp_beta[number] = as.numeric(beta[2]) exp_se[number] = as.numeric(se[2]) exp_pvalue[number] = as.numeric(pval[2]) exp_variable[number] = exposure number = number + 1 } }

Create a dataframe with results:

outcome = data.frame(out_variable, out_beta, out_se, out_pvalue) exposure = data.frame(exp_variable, exp_beta, exp_se, exp_pvalue)

We have 2 different dataframes with our results and we need to combine in one. With the help of `tidyverse`

package this is a simple task. Basically we rename variables by giving the same name and after we merge both dataframes together.

library(tidyverse) outcome = outcome %>% rename( variable = out_variable, beta = out_beta, se = out_se, pvalue = out_pvalue, obs = out_nobs ) exposure = exposure %>% rename( variable = exp_variable, beta = exp_beta, se = exp_se, pvalue = exp_pvalue, obs = exp_nobs ) all = rbind(outcome, exposure) all = na.omit(all) head(all)variable beta se pvalue 1 dx1 0.1 0.002 0.950 3 dx2 0.2 0.002 0.826 ........ 2 ix1 0.1 0.002 0.950 4 ix2 0.2 0.002 0.826 ........

Yet, this is not a dataframe that we are looking for. We need a dataframe to have both dependent and independent variables in one row. Therefore, we do the final transformation as follows:

data = all %>% mutate( type = substr(variable, 1, 2) ) %>% spread(type, variable) %>% rename( d = dx, i = ix ) %>% mutate ( beta = round(beta, 5), se = round(se, 5), pvalue = round(pvalue, 5) ) %>% select(d, i, beta, se, pvalue) head(data)d i beta se pvalue 1 dx1 ix1 0.1 0.002 0.950 2 dx2 ix2 0.2 0.002 0.826 3 dx3 ix3 0.3 0.005 0.123

I hope you find this post useful for your research and data analysis!

Related Post

- Logistic Regression Regularized with Optimization
- Analytical and Numerical Solutions to Linear Regression Problems
- Regression model with auto correlated errors – Part 3, some astrology
- Regression model with auto correlated errors – Part 2, the models
- Regression model with auto correlated errors – Part 1, the data

Related Post

`ggplot2`

and `ggmap`

from CRAN will result in an error. To avoid the error be sure to install both packages from GitHub with the package `devtools`

and restart R if the problem persists.
devtools::install_github("dkahle/ggmap") devtools::install_github("hadley/ggplot2")

Note, at the time this writing the package `gganimate`

requires the package `cowplot`

. Be sure to install and load the package before continuing.

install.packages("cowplot") library(cowplot)

In Part 1 and Part 2 of this series we made maps of metro systems across four European cities, and then computed Delaunay triangulations and centroids for each city. In the third and final part, we’ll do these same steps but at multiple time points to make a .gif of how metro systems change over time.

As a reminder, our data is the hand corrected values of the data we pulled down from Google. To see how we got the data go back to Part 1: Data.

We now have a good sense of what each city’s current metro system looks like, but how did these systems come to be this way? Now we’ll look at how these systems have changed and grown over time. That’s why at the beginning we made a column for `opened_year`

. At this point the code gets less elegant but we’ll go through it step by step. It’s all the same principles as when we made our figures earlier.

The main idea of the following code is that we’re going to create unique triangulations for each year within each city. As more metro stations get added each year the triangulation will change. Just as we had `data_deldir_delsgs`

and `data_deldir_cent`

, we’re going to start by creating two empty data frames `time_deldir_delsgs`

and `time_deldir_sum`

(remember that our centroid data frame was based on the summary data). With our empty data frames initialized we can make a `for loop`

. We want to go through each year, but for each city separately, so our first `for loop`

goes through each city, filtering our data to only the city in question. Next we have our second `for loop`

going through each year starting with the minimum year in the data for that city and up to 2015, the maximum year for the full data set. For a given year we `filter()`

to include only metro stops that were opened that year or earlier. We do equal to or less than because we don’t want to ignore metro stops from earlier years, we want the whole metro system as it exists for a given year. Note, we need at least three points to make a triangle, and you may think that a city wouldn’t ever have only one or two metro stops but you would be wrong (*cough* Barcelona *cough*), so we’re going to put a stop gap saying if the number of data points is less than three the loop should skip that year and move to the next one.

Okay, assuming there are at least three data points though we’re going to run the `deldir()`

call and then save it in `year_deldir`

. Then we create two new data frames. the first is `year_deldir_delsgs`

which contains the `delsgs`

information from `deldir`

. We’re going to add two columns too, `city`

and `opened_year`

, so we know which city and year this data comes from. We then add this information to our existing `time_deldir_delsgs`

data frame with a `bind_rows()`

call. We then do the same thing to create `year_deldir_sum`

, only we pull out the `summary`

information from `year_deldir`

instead of the `delsgs`

information. We also add our `city`

and `opened_year`

columns and then `bind_rows()`

it with `time_deldir_sum`

. The loop does this for every city from the minimum year in the data up to 2015. See below the head of the two data frames we created.

time_deldir_delsgs = data.frame() time_deldir_sum = data.frame() for(c in c("Paris", "Berlin", "Barcelona", "Prague")) { data_city = filter(data, city == c) for(year in min(data_city$opened_year):2015) { data_year = filter(data_city, opened_year <= year) # Add condition to skip if number of stops less than 3 if(dim(data_year)[1] < 3) next year_deldir = deldir(data_year$lon, data_year$lat) year_deldir_delsgs = year_deldir$delsgs %>% mutate(city = c) %>% mutate(opened_year = year) time_deldir_delsgs = bind_rows(time_deldir_delsgs, year_deldir_delsgs) year_deldir_sum = year_deldir$summary %>% mutate(city = c) %>% mutate(opened_year = year) time_deldir_sum = bind_rows(time_deldir_sum, year_deldir_sum) } } head(time_deldir_delsgs) head(time_deldir_sum)x1 y1 x2 y2 ind1 ind2 city opened_year 1 2.358279 48.85343 2.369510 48.85302 11 2 Paris 1900 2 2.374377 48.84430 2.369510 48.85302 9 2 Paris 1900 3 2.374377 48.84430 2.358279 48.85343 9 11 Paris 1900 4 2.414484 48.84640 2.369510 48.85302 16 2 Paris 1900 5 2.414484 48.84640 2.374377 48.84430 16 9 Paris 1900 6 2.386581 48.84732 2.369510 48.85302 18 2 Paris 1900 x y n.tri del.area del.wts n.tside nbpt dir.area dir.wts city opened_year 1 2.289364 48.87577 4 1.9e-05 0.012510 4 2 0.000066 0.009329 Paris 1900 2 2.369510 48.85302 5 7.0e-05 0.047518 4 2 0.000520 0.073787 Paris 1900 3 2.290121 48.86685 5 3.7e-05 0.024907 5 0 0.000074 0.010541 Paris 1900 4 2.313310 48.86750 4 3.7e-05 0.024692 3 4 0.000388 0.054946 Paris 1900 5 2.294900 48.87393 5 1.6e-05 0.010551 3 2 0.000061 0.008594 Paris 1900 6 2.347301 48.85869 4 1.1e-05 0.007359 3 4 0.000392 0.055588 Paris 1900

As you may recall though we’re not necessarily interested in all the summary information, we just want it to compute our centroid. So, we make a new data frame `time_deldir_cent`

. The code is the same as our earlier code for computing centroids, the only difference is that we’ll also group by `opened_year`

, not just `city`

, since we want unique centroids for each year for each city. See part of the data frame of the centroids below.

time_deldir_cent = time_deldir_sum %>% group_by(city, opened_year) %>% summarise(cent_x = sum(x * del.wts), cent_y = sum(y * del.wts)) %>% ungroup() head(time_deldir_cent)# A tibble: 6 × 4 city opened_year cent_x cent_y <chr> <int> <dbl> <dbl> 1 Barcelona 1924 2.116839 41.38132 2 Barcelona 1925 2.120019 41.37782 3 Barcelona 1926 2.121921 41.37834 4 Barcelona 1927 2.121921 41.37834 5 Barcelona 1928 2.122325 41.37384 6 Barcelona 1929 2.113628 41.37543

There’s still one more thing I want to do before we make our figures. Right now the figures will have different start dates depending on when the first metro stop was built in a given city. Instead, I want all figures to start at the same year so we see them change over time with the same start date for each city. To do this we’ll make a new data frame called `years`

that simply lists the years 1900 to 2015 four times, once for each city. We then do a `left_join()`

with our data. As a result any time the `opened_year`

in question is not found in the data frame for a given city an empty row will be added, empty except for the `opened_year`

and `city`

values. You’ll also notice that I `filter()`

ed to only include decade years (1900, 1910, 1920, etc.), and the year 2015 so it includes the last year of our data. This is because if we include every year our gif will be very large and non-portable. Also it’s more dramatic to see changes every 10 years.

years = data.frame(opened_year = rep(seq(1900, 2015), 4), city = c(rep("Paris", 116), rep("Berlin", 116), rep("Barcelona", 116), rep("Prague", 116))) data_time = left_join(years, data) %>% mutate(opened_by_year = ifelse(opened_year %% 10 == 0, opened_year, opened_year + (10 - (opened_year %% 10)))) %>% filter(opened_by_year % filter(opened_year %% 10 == 0 | opened_year == 2015) time_deldir_cent_sub = time_deldir_cent %>% filter(opened_year %% 10 == 0 | opened_year == 2015)

I kept saying we were going to make maps showing the change over time, but how are we going to do that? Well instead of building a single static plot for each city we’re going to build an animation where as the year changes so will the map. To do this we’ll use the package `gganimate`

which works on top of `ggplot2`

(which is useful since we’re already using `ggmap`

which works on top of `ggplot2`

). We build our plot just as we would any other `ggplot2`

figure, but for data we want to add the `frame`

setting. The `frame`

is the thing in the plot that changes, in our case `opened_year`

. Also, while we only want to plot the triangulations and centroids specific to a given year, we want the points for the metro stops to be additive. For example, when `frame`

is 2000 we still want the points from 1990 to be plotted. To do this we add `cumulative = TRUE`

to the call for those points. Finally, since we updated our data to include empty rows so that all plots start on 1900, all plots will have a frame starting at 1900, even if there are no data points to plot. I’ve again made a function to make our plots. See below for the code for the Paris map as well as all four animations. Also, notice that in 1920 (actually 1912) Barcelona gets their first metro stop…but doesn’t get anymore until 1930 (actually 1924). Take a look to see if you can find any other interesting things about how the systems changed over time.

devtools::install_github("dgrtwo/gganimate") library(gganimate) time_plot = function(city_name, city_map){ ggmap(city_map, extent = "device") + geom_segment(data = subset(time_deldir_delsgs_sub, city == city_name), aes(x = x1, y = y1, xend = x2, yend = y2, frame = opened_year), size = 1, color= "#92c5de") + geom_point(data = subset(data_time, city == city_name), aes(x = lon, y = lat, frame = opened_by_year, cumulative = TRUE), color = "#0571b0", size = 3) + geom_point(data = subset(time_deldir_cent_sub, city == city_name), aes(x = cent_x, y = cent_y, frame = opened_year), size = 6, color= "#ca0020") + theme(plot.title = element_text(hjust = 0.5)) } paris_time.plot = time_plot("Paris", paris_map) gganimate(paris_time.plot)

In these three post we looked at how the metro systems of four European cities changed over time. To do this we used a lot of different packages. We used the packages `dplyr`

, `tidyr`

, `purrr`

, and `ggplot2`

, which are all now a part of the package `tidyverse`

. We used used two other plotting packages that build upon `ggplot2`

, `ggmap`

and `gganimate`

. Finally we used the `deldir`

package to make Delaunay triangulations and compute centroids of city metro systems over time. All of these skills can be applied to any other type of spacial data with unique shapes, and can be used to make your very own gifs. Try your city as a practice exercise!

Related Post