Related Post

On the other hand, you may want to get a basic understanding of stock prices time series forecasting by taking advantage of a simple model providing with a sufficient reliability. For such purpose, the Black-Scholes-Merton model as based upon the lognormal distribution hypothesis and largely used in financial analysis can be helpful. The rest of this short dissertation shows how to take advantage of it.

As said, I am going to introduce the Black-Scholes-Merton model which assumes that percentage changes in the stock price in a short period of time are normally distributed.

The return of the stock price **S(t)** at time t can be expressed under those hypothesis as:

$$

\begin{equation}

\begin{aligned}

\frac{S(t)-S(t_{0})}{S(t_{0})}\ \sim\ N(u\ \Delta T,\ \sigma^2\Delta T) \ \ \ \ \ (1) \\

\end{aligned}

\end{equation}

$$

where the left term is the (discrete) return on stock price S at time t. By formulating the same equation in terms of first order differentials for price S (dS) and time t (dt), the equation (1) turns out to be expressed in terms of continuously compounded returns, obtaining:

$$

\begin{equation}

\begin{cases}

t = t_{0} + \Delta T \\

ln(\frac{S(t)}{S(t_{0})})\ \sim \ N((u\ -\ \frac{\sigma^2}{2})\ \Delta T,\ \sigma^2\Delta T) \ \ \ \ \ (2) \\

\end{cases}

\end{equation}

$$

Equation (2) states that the log returns follow a normal distribution, where **u** is the stock price drift, **σ** is the stock price standard deviation. From equation (2), it can be stated the following relationship binding stock price S(t) with stock price S(t0):

$$

\begin{equation}

\begin{aligned}

S(t)\ =\ S(t_{0})\ e^{(u\ -\ \frac{\sigma^2}{2})\ +\ \sigma B(t)} \ \ \ \ \ (3) \\

\end{aligned}

\end{equation}

$$

The **B(t)** term represents the Brownian motion.

Furthermore, equation (2) allows to determine the distribution of the stock price as stated by the following equation:

$$

\begin{equation}

\begin{aligned}

\ln(S(t_{0}+\Delta T))\ \sim \ N(ln(S(t_{0})) + (u\ -\ \frac{\sigma^2}{2})\Delta T,\ \sigma^2\Delta T)\ \ \ \ \ (4) \\

\end{aligned}

\end{equation}

$$

The drift **u** and the standard deviation **σ** can be estimated from the stock price time series history.

The time interval **ΔT** represents our future horizon. Please note that both the drift and the standard deviation must refer to the same time scale, in the sense that if **ΔT** is measured in days, we have to use the daily returns and daily standard deviation, if **ΔT** is measured in weeks, we have to use weekly returns and weekly standard deviation, and so on.

Taking the exponential of both terms of equation (4) we obtain:

$$

\begin{equation}

\begin{aligned}

S(t_{0} + \Delta T)\ \sim \ exp(\ N(ln(S(t_{0})) + (u\ -\ \frac{\sigma^2}{2})\ \Delta T,\ \sigma^2 \Delta T)) \\

= exp(\ N(\ \hat u(\Delta T),\ \hat\sigma^2(\Delta T)) \ \ \ \ \ (5) \\

\end{aligned}

\end{equation}

$$

Above equation provides with a family of normal distributions having known parameters and dependent on the time interval **ΔT** = [0, T]. Lower and upper bounds at each instant of time t in ΔT can be modeled as a 95% confidence interval as determined by the 1.96 multiple of the standard deviation at time t in ΔT. As a result, the expected value, lower and upper bounds of the stock price S(t) are so determined:

$$

\begin{equation}

\begin{cases}

E(S(t))\ =\ exp(ln(S(t_{0})) + (u\ -\ \frac{\sigma^2}{2})\ \Delta T) \ \ \ \ \ (6) \\

LB(S(t))\ = exp(ln(S(t_{0})) + (u\ -\ \frac{\sigma^2}{2})\ \Delta T\ -\ 1.96*\sigma \sqrt \Delta T) \\

UB(S(t))\ = exp(ln(S(t_{0})) + (u\ -\ \frac{\sigma^2}{2})\ \Delta T\ +\ 1.96*\sigma \sqrt \Delta T) \\

\end{cases}

\end{equation}

$$

I will take advantage of the *timeSeries* package for computing returns (see returns() function) and the *quantmod* package for financial stock prices availability (see getSymbols() function). Specifically, I will download the Apple share price history.

suppressPackageStartupMessages(library(timeSeries)) suppressPackageStartupMessages(library(quantmod)) # downloading stock price getSymbols("AAPL") head(AAPL)AAPL.Open AAPL.High AAPL.Low AAPL.Close AAPL.Volume AAPL.Adjusted 2007-01-03 86.29 86.58 81.90 83.80 309579900 10.85709 2007-01-04 84.05 85.95 83.82 85.66 211815100 11.09807 2007-01-05 85.77 86.20 84.40 85.05 208685400 11.01904 2007-01-08 85.96 86.53 85.28 85.47 199276700 11.07345 2007-01-09 86.45 92.98 85.15 92.57 837324600 11.99333 2007-01-10 94.75 97.80 93.45 97.00 738220000 12.56728

# using adjusted close price Y <- coredata(AAPL[,"AAPL.Adjusted"]) # history time span hist_start <- 1 hist_end <- 100 hist <- c(hist_start:hist_end) # historical prices Y.hist <- Y[hist] # historical returns = (Y(t1)-Y(t0))/Y(t0) Y.hist.ret <- returns(Y.hist) # standard deviation computed based on history (sv_hist <- sd(Y.hist.ret, na.rm=T))[1] 0.01886924

Aboveshown value is the estimate of our standard deviation of our share price daily returns as determined by hystorical observations.

It is a good practice to compute the confidence intervals for estimated parameters in order to understand if we have sufficient precision as implied by the samples set size.

# 95% confidence interval n <- length(hist) sv_hist_low <- sqrt((n-1)*sv_hist^2/qchisq(.975, df=n-1)) sv_hist_up <- sqrt((n-1)*sv_hist^2/qchisq(.025, df=n-1)) (sv_hist_confint_95 <- c(sv_hist_low, sv_hist, sv_hist_up))[1] 0.01656732 0.01886924 0.02191993

I am going to show a plot outlining future share price evolution.

# relative future time horizon t <- 1:20 # martingale hypothesis (the average of future returns is equal to the current value) u <- 0 # future expected value as based on normal distribution hypothesis fc <- log(Y.hist[hist_end]) + (u - 0.5*sv_hist^2)*t # lower bound 95% confidence interval fc.lb <- fc - 1.96*sv_hist*sqrt(t) # upper bound 95% confidence interval fc.ub <- fc + 1.96*sv_hist*sqrt(t) # collecting lower, expected and upper values fc_band <- list(lb = exp(fc.lb), m = exp(fc), ub = exp(fc.ub)) # absolute future time horizon xt <- c(hist_end + t) # stock price history line plot plot(Y[hist_start:(hist_end + max(t))], type='l', xlim = c(0, hist_end + max(t)), ylim = c(5, 20), xlab = "Time Index", ylab = "Share Price", panel.first = grid()) # starting point for our forecast suppressWarnings(points(x = hist_end, y = Y.hist[hist_start+hist_end-1], pch = 21, bg = "green")) # lower bound stock price forecast lines(x = xt, y = fc_band$lb, lty = 'dotted', col = 'violet', lwd = 2) # expected stock price forecast lines(x = xt, y = fc_band$m, lty = 'dotted', col = 'blue', lwd = 2) # upper bound stock price forecast lines(x = xt, y = fc_band$ub, lty = 'dotted', col = 'red', lwd = 2)

The plot shows the lower (violet) and upper (red) bounds including the actual future price evolution and the forecasted expected value (blue). In that, I did not account for a drift **u** (u = 0) and as a consequence, there is a flat line representing the future expected value (actually its slope is slightly negative as determined by the -0.5*σ^2 term).

If you like to have future stock price drift more consistent with its recent history, you may compute a return based on the same time scale of the standard deviation.

The lines of code to add are the following:

# added line of code (mu_hist <- mean(Y.hist.ret, na.rm=T)) n <- length(hist) # 95% confidence interval for the mean mu_hist_low <- mu_hist - qt(0.975, df=n-1)*sv_hist/sqrt(n) mu_hist_up <- mu_hist + qt(0.975, df=n-1)*sv_hist/sqrt(n) (mu_hist_confint_95 <- c(mu_hist_low, mu_hist, mu_hist_up))[1] -0.0006690514 0.0030750148 0.0068190811

Above the confidence interval for historical daily returns is shown. We have also to change the assigment to the variable u, the drift.

# drift computed on historical values (u <- mu_hist)[1] 0.0030750148

Furthermore, the code shown above can be easily enhanced with sliders to specify the stock price history to take into account for parameters estimation and the desired future time horizon. That can be done by taking advantage of the manipulate package or by implementing a Shiny gadget or application, for example.

Using the same model is possible to compute the probability that the future stock price be above or below a predetermined value at time t. That is possible by computing the normal distribution parameters as a function of ΔT = t – t0 and a density distribution basic property. Herein is how.

This is the historical share price daily drift **u**:

(u <- mu_hist)[1] 0.003075015

This is the current share price **S(t0)**:

(curr_share_price <- Y.hist[hist_end])[1] 14.72056

This is the mean **mu_t** of our normal distribution computed with ΔT= 10, ten units of time (days) ahead of the current time:

t <- 10 (mu_t <- log(curr_share_price) + u - 0.5*sv_hist^2)*t[1] 26.92142

This is the standard deviation **sv_t** of our normal distribution computed with ΔT = 10, ten units of time (days) ahead of the current time:

(sv_t <- sv_hist*sqrt(t))[1] 0.05966977

Arbitrarly, I determine a target price 10% higher of the current one and hence equal to:

(target_share_price <- 1.10*curr_share_price)[1] 16.19261

The probability that the share price at time t is equal or greater (please note the lower.tail = FALSE parameter) than the target price is the following:

pnorm(q = log(target_share_price), mean = mu_t, sd = sv_t, lower.tail = FALSE)[1] 0.06072166

Our model states there is a probability of 6% that share price is above or equal to the target price.

**The Misbehavior of Markets: criticism to the lognormal hypothesis**

In 1963, Mandelbrot published a research highlighting that, contrary to the general assumption that share price movements were normally distributed, they instead followed a Pareto-Levy distribution, which has infinite variance. That implies that values considered with negligible probability determined by normal distribution hypothesis, they actually are not that unlikely to happen in case of Pareto-Levy distribution.

Based on that market booms and crashes are more frequent than we may think. Be aware of this while applying lognormal distribution assumptions.

We have outlined an approach that can be easily implemented to compute expected values, lower and upper bounds of future stock prices. That is based on the well known Black-Scholes-Merton model and its normal distribution hypothesis.

The plot showing future share price expected value, lower and upper bounds can be enhanced with interactive inputs to allow users to select history length and future horizon of interest. By using the same model, probabilities associated to future price thresholds can be computed.

A reference for the Black-Scholes-Merton model we talked about can be found in the following book:

- Options, Futures and Other Derivatives, John C. Hull, Prentice Hall, 8th Ed.

A reference for Mandlebrot criticism to share price lognormal distribution hypothesis is the following book:

- The Misbehavior of Markets: a Fractal View of Finance Turbolence, Benoit Mandelbrot & Richard L. Hudson, Basic Books Ed.

Disclaimer

Any securities or databases referred in this post are solely for illustration purposes, and under no regard should the findings presented here be interpreted as investment advice or a promotion of any particular security or source.

If you have any question, feel free to comment below.

Related Post

Related Post

To use MongoDB with R, first, we have to download and install MongoDB

Next, start MongoDB. We can start MongoDB like so:

mongod

We can use the mongolite, package which is a fast and simple MongoDB client for R, to use MongoDB with R.

Let’s insert the crimes data from data.gov to MongoDB. The dataset reflects reported incidents of crime (with the exception of murders where data exists for each victim) that occurred in the City of Chicago since 2001.

library(ggplot2) library(dplyr) library(maps) library(ggmap) library(mongolite) library(lubridate) library(gridExtra) crimes=data.table::fread("Crimes_2001_to_present.csv") names(crimes)'ID' 'Case Number' 'Date' 'Block' 'IUCR' 'Primary Type' 'Description' 'Location Description' 'Arrest' 'Domestic' 'Beat' 'District' 'Ward' 'Community Area' 'FBI Code' 'X Coordinate' 'Y Coordinate' 'Year' 'Updated On' 'Latitude' 'Longitude' 'Location'

Let’s remove spaces in the column names to avoid any problems when we query it from MongoDB.

names(crimes) = gsub(" ","",names(crimes)) names(crimes)'ID' 'CaseNumber' 'Date' 'Block' 'IUCR' 'PrimaryType' 'Description' 'LocationDescription' 'Arrest' 'Domestic' 'Beat' 'District' 'Ward' 'CommunityArea' 'FBICode' 'XCoordinate' 'YCoordinate' 'Year' 'UpdatedOn' 'Latitude' 'Longitude' 'Location'

we can use the insert function from the mongolite package to insert rows to a collection in MongoDB.

Let’s create a database called Chicago and call the collection crimes.

my_collection = mongo(collection = "crimes", db = "Chicago") # create connection, database and collection my_collection$insert(crimes)

Let’s check if we have inserted the “crimes” data.

my_collection$count()6261148

We see that the collection has 6261148 records.

First, let’s look what the data looks like by displaying one record:

my_collection$iterate()$one()$ID 1454164 $CaseNumber 'G185744' $Date '04/01/2001 06:00:00 PM' $Block '049XX N MENARD AV' $IUCR '0910' $PrimaryType 'MOTOR VEHICLE THEFT' $Description 'AUTOMOBILE' $LocationDescription 'STREET' $Arrest 'false' $Domestic 'false' $Beat 1622 $District 16 $FBICode '07' $XCoordinate 1136545 $YCoordinate 1932203 $Year 2001 $UpdatedOn '08/17/2015 03:03:40 PM' $Latitude 41.970129962 $Longitude -87.773302309 $Location '(41.970129962, -87.773302309)'

How many distinct “Primary Type” do we have?

length(my_collection$distinct("PrimaryType"))35

As shown above, there are 35 different crime primary types in the database. We will see the patterns of the most common crime types below.

From the *iterate()iterate() $one()* command above, we have seen that one of the columns is Domestic, which shows whether the crime is domestic or not.

my_collection$count('{"PrimaryType" : "ASSAULT", "Domestic" : "true" }')82470

To get the filtered data and we can also retrieve only the columns of interest

query1= my_collection$find('{"PrimaryType" : "ASSAULT", "Domestic" : "true" }') query2= my_collection$find('{"PrimaryType" : "ASSAULT", "Domestic" : "true" }', fields = '{"_id":0, "PrimaryType":1, "Domestic":1}') ncol(query1) # with all the columns ncol(query2) # only the selected columns22 2

my_collection$aggregate('[{"$group":{"_id":"$LocationDescription", "Count": {"$sum":1}}}]')%>%na.omit()%>% arrange(desc(Count))%>%head(10)%>% ggplot(aes(x=reorder(`_id`,Count),y=Count))+ geom_bar(stat="identity",color='skyblue',fill='#b35900')+geom_text(aes(label = Count), color = "blue") +coord_flip()+xlab("Location Description")

If loading the entire dataset we are working with does not slow down our analysis, we can use data.table or dplyr but when dealing with big data, using MongoDB can give us performance boost as the whole data will not be loaded into mememory. We can reproduce the above plot without using MongoDB, like so:

crimes%>%group_by(`LocationDescription`)%>%summarise(Total=n())%>% arrange(desc(Total))%>%head(10)%>% ggplot(aes(x=reorder(`LocationDescription`,Total),y=Total))+ geom_bar(stat="identity",color='skyblue',fill='#b35900')+geom_text(aes(label = Total), color = "blue") +coord_flip()+xlab("Location Description")

What if we want to query all records for certain columns only? This helps us to load only the columns we want and to save memory for our analysis.

query3= my_collection$find('{}', fields = '{"_id":0, "Latitude":1, "Longitude":1,"Year":1}')

We can explore any patterns of domestic crimes. For example, are they common in certain days/hours/months?

domestic=my_collection$find('{"Domestic":"true"}', fields = '{"_id":0, "Domestic":1,"Date":1}') domestic$Date= mdy_hms(domestic$Date) domestic$Weekday = weekdays(domestic$Date) domestic$Hour = hour(domestic$Date) domestic$month = month(domestic$Date,label=TRUE) WeekdayCounts = as.data.frame(table(domestic$Weekday)) WeekdayCounts$Var1 = factor(WeekdayCounts$Var1, ordered=TRUE, levels=c("Sunday", "Monday", "Tuesday", "Wednesday", "Thursday", "Friday","Saturday")) ggplot(WeekdayCounts, aes(x=Var1, y=Freq)) + geom_line(aes(group=1),size=2,color="red") + xlab("Day of the Week") + ylab("Total Domestic Crimes")+ ggtitle("Domestic Crimes in the City of Chicago Since 2001")+ theme(axis.title.x=element_blank(),axis.text.y = element_text(color="blue",size=11,angle=0,hjust=1,vjust=0), axis.text.x = element_text(color="blue",size=11,angle=0,hjust=.5,vjust=.5), axis.title.y = element_text(size=14), plot.title=element_text(size=16,color="purple",hjust=0.5))

Domestic crimes are common over the weekend than in weekdays? What could be the reason?

We can also see the pattern for each day by hour:

DayHourCounts = as.data.frame(table(domestic$Weekday, domestic$Hour)) DayHourCounts$Hour = as.numeric(as.character(DayHourCounts$Var2)) ggplot(DayHourCounts, aes(x=Hour, y=Freq)) + geom_line(aes(group=Var1, color=Var1), size=1.4)+ylab("Count")+ ylab("Total Domestic Crimes")+ggtitle("Domestic Crimes in the City of Chicago Since 2001")+ theme(axis.title.x=element_text(size=14),axis.text.y = element_text(color="blue",size=11,angle=0,hjust=1,vjust=0), axis.text.x = element_text(color="blue",size=11,angle=0,hjust=.5,vjust=.5), axis.title.y = element_text(size=14), legend.title=element_blank(), plot.title=element_text(size=16,color="purple",hjust=0.5))

The crimes peak mainly around mid-night. We can also use one color for weekdays and another color for weekend as shown below.

DayHourCounts$Type = ifelse((DayHourCounts$Var1 == "Sunday") | (DayHourCounts$Var1 == "Saturday"), "Weekend", "Weekday") ggplot(DayHourCounts, aes(x=Hour, y=Freq)) + geom_line(aes(group=Var1, color=Type), size=2, alpha=0.5) + ylab("Total Domestic Crimes")+ggtitle("Domestic Crimes in the City of Chicago Since 2001")+ theme(axis.title.x=element_text(size=14),axis.text.y = element_text(color="blue",size=11,angle=0,hjust=1,vjust=0), axis.text.x = element_text(color="blue",size=11,angle=0,hjust=.5,vjust=.5), axis.title.y = element_text(size=14), legend.title=element_blank(), plot.title=element_text(size=16,color="purple",hjust=0.5))

The difference between weekend and weekdays are more clear from this figure than from the previous plot. We can also see the above pattern from a heatmap.

DayHourCounts$Var1 = factor(DayHourCounts$Var1, ordered=TRUE, levels=c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday")) ggplot(DayHourCounts, aes(x = Hour, y = Var1)) + geom_tile(aes(fill = Freq)) + scale_fill_gradient(name="Total MV Thefts", low="white", high="red") + ggtitle("Domestic Crimes in the City of Chicago Since 2001")+theme(axis.title.y = element_blank())+ylab("")+ theme(axis.title.x=element_text(size=14),axis.text.y = element_text(size=13), axis.text.x = element_text(size=13), axis.title.y = element_text(size=14), legend.title=element_blank(), plot.title=element_text(size=16,color="purple",hjust=0.5))

From the heatmap, we can see more crimes over weekends and at night.

monthCounts = as.data.frame(table(domestic$month)) ggplot(monthCounts, aes(x=Var1, y=Freq)) + geom_line(aes(group=1),size=2,color="red") + xlab("Day of the Week") + ylab("Total Domestic Crimes")+ ggtitle("Domestic Crimes in the City of Chicago Since 2001")+ theme(axis.title.x=element_blank(),axis.text.y = element_text(color="blue",size=11,angle=0,hjust=1,vjust=0), axis.text.x = element_text(color="blue",size=11,angle=0,hjust=.5,vjust=.5), axis.title.y = element_text(size=14), plot.title=element_text(size=16,color="purple",hjust=0.5))

Is domestic crime associated with temperature? Domestic crimes tremendously increases during the warm months.

Now, let’s see the pattern of other crime types. Since there are 35 primary types, we cannot see all of them in this post. Let’s focus on four of the most common ones.

crimes=my_collection$find('{}', fields = '{"_id":0, "PrimaryType":1,"Year":1}') crimes%>%group_by(PrimaryType)%>%summarize(Count=n())%>%arrange(desc(Count))%>%head(4)Imported 6261148 records. Simplifying into dataframe... PrimaryType Count THEFT 1301434 BATTERY 1142377 CRIMINAL DAMAGE 720143 NARCOTICS 687790

As shown in the table above, the most common crime type is theft followed by battery. Narcotics is fourth most common while criminal damage is the third most common crime type in the city of Chicago.

Now, let’s generate plots by day and hour.

four_most_common=crimes%>%group_by(PrimaryType)%>%summarize(Count=n())%>%arrange(desc(Count))%>%head(4) four_most_common=four_most_common$PrimaryType crimes=my_collection$find('{}', fields = '{"_id":0, "PrimaryType":1,"Date":1}') crimes=filter(crimes,PrimaryType %in%four_most_common) crimes$Date= mdy_hms(crimes$Date) crimes$Weekday = weekdays(crimes$Date) crimes$Hour = hour(crimes$Date) crimes$month=month(crimes$Date,label = TRUE) g = function(data){ WeekdayCounts = as.data.frame(table(data$Weekday)) WeekdayCounts$Var1 = factor(WeekdayCounts$Var1, ordered=TRUE, levels=c("Sunday", "Monday", "Tuesday", "Wednesday", "Thursday", "Friday","Saturday")) ggplot(WeekdayCounts, aes(x=Var1, y=Freq)) + geom_line(aes(group=1),size=2,color="red") + xlab("Day of the Week") + theme(axis.title.x=element_blank(),axis.text.y = element_text(color="blue",size=10,angle=0,hjust=1,vjust=0), axis.text.x = element_text(color="blue",size=10,angle=0,hjust=.5,vjust=.5), axis.title.y = element_text(size=11), plot.title=element_text(size=12,color="purple",hjust=0.5)) } g1=g(filter(crimes,PrimaryType=="THEFT"))+ggtitle("Theft")+ylab("Total Count") g2=g(filter(crimes,PrimaryType=="BATTERY"))+ggtitle("BATTERY")+ylab("Total Count") g3=g(filter(crimes,PrimaryType=="CRIMINAL DAMAGE"))+ggtitle("CRIMINAL DAMAGE")+ylab("Total Count") g4=g(filter(crimes,PrimaryType=="NARCOTICS"))+ggtitle("NARCOTICS")+ylab("Total Count") grid.arrange(g1,g2,g3,g4,ncol=2)

From the plots above, we see that theft is most common on Friday. Battery and criminal damage, on the other hand, are highest at weekend. We also observe that narcotics decreases over weekend.

We can also see the pattern of the above four crime types by hour:

g=function(data){ DayHourCounts = as.data.frame(table(data$Weekday, data$Hour)) DayHourCounts$Hour = as.numeric(as.character(DayHourCounts$Var2)) ggplot(DayHourCounts, aes(x=Hour, y=Freq)) + geom_line(aes(group=Var1, color=Var1), size=1.4)+ylab("Count")+ theme(axis.title.x=element_text(size=14),axis.text.y = element_text(color="blue",size=11,angle=0,hjust=1,vjust=0), axis.text.x = element_text(color="blue",size=11,angle=0,hjust=.5,vjust=.5), axis.title.y = element_text(size=14), legend.title=element_blank(), plot.title=element_text(size=16,color="purple",hjust=0.5)) } g1=g(filter(crimes,PrimaryType=="THEFT"))+ggtitle("Theft")+ylab("Total Count") g2=g(filter(crimes,PrimaryType=="BATTERY"))+ggtitle("BATTERY")+ylab("Total Count") g3=g(filter(crimes,PrimaryType=="CRIMINAL DAMAGE"))+ggtitle("CRIMINAL DAMAGE")+ylab("Total Count") g4=g(filter(crimes,PrimaryType=="NARCOTICS"))+ggtitle("NARCOTICS")+ylab("Total Count") grid.arrange(g1,g2,g3,g4,ncol=2)

g=function(data){ monthCounts = as.data.frame(table(data$month)) ggplot(monthCounts, aes(x=Var1, y=Freq)) + geom_line(aes(group=1),size=2,color="red") + xlab("Day of the Week") + theme(axis.title.x=element_blank(),axis.text.y = element_text(color="blue",size=11,angle=0,hjust=1,vjust=0), axis.text.x = element_text(color="blue",size=11,angle=0,hjust=.5,vjust=.5), axis.title.y = element_text(size=14), plot.title=element_text(size=16,color="purple",hjust=0.5)) } g1=g(filter(crimes,PrimaryType=="THEFT"))+ggtitle("Theft")+ylab("Total Count") g2=g(filter(crimes,PrimaryType=="BATTERY"))+ggtitle("BATTERY")+ylab("Total Count") g3=g(filter(crimes,PrimaryType=="CRIMINAL DAMAGE"))+ggtitle("CRIMINAL DAMAGE")+ylab("Total Count") g4=g(filter(crimes,PrimaryType=="NARCOTICS"))+ggtitle("NARCOTICS")+ylab("Total Count") grid.arrange(g1,g2,g3,g4,ncol=2)

Except, narcotics, all increase in the warmer months. Does this have any association with temperature?

chicago = get_map(location = "chicago", zoom = 11) # Load a map of Chicago into R:

Round our latitude and longitude to 2 digits of accuracy, and create a crime counts data frame for each area:

query3= my_collection$find('{}', fields = '{"_id":0, "Latitude":1, "Longitude":1,"Year":1}') LatLonCounts=as.data.frame(table(round(query3$Longitude,2),round(query3$Latitude,2)))

Convert our Longitude and Latitude variable to numbers:

LatLonCounts$Long = as.numeric(as.character(LatLonCounts$Var1)) LatLonCounts$Lat = as.numeric(as.character(LatLonCounts$Var2)) ggmap(chicago) + geom_tile(data = LatLonCounts, aes(x = Long, y = Lat, alpha = Freq), fill="red")+ ggtitle("Crime Distribution")+labs(alpha="Count")+theme(plot.title = element_text(hjust=0.5))

We can also see a map for domestic crimes only:

domestic=my_collection$find('{"Domestic":"true"}', fields = '{"_id":0, "Latitude":1, "Longitude":1,"Year":1}') LatLonCounts=as.data.frame(table(round(domestic$Longitude,2),round(domestic$Latitude,2))) LatLonCounts$Long = as.numeric(as.character(LatLonCounts$Var1)) LatLonCounts$Lat = as.numeric(as.character(LatLonCounts$Var2)) ggmap(chicago) + geom_tile(data = LatLonCounts, aes(x = Long, y = Lat, alpha = Freq), fill="red")+ ggtitle("Domestic Crimes")+labs(alpha="Count")+theme(plot.title = element_text(hjust=0.5))

Let’s see where motor vehicle theft is common:

mtheft=my_collection$find('{"PrimaryType":"MOTOR VEHICLE THEFT"}', fields = '{"_id":0, "Latitude":1, "Longitude":1,"Year":1}') LatLonCounts=as.data.frame(table(round(mtheft$Longitude,2),round(mtheft$Latitude,2))) LatLonCounts$Long = as.numeric(as.character(LatLonCounts$Var1)) LatLonCounts$Lat = as.numeric(as.character(LatLonCounts$Var2)) ggmap(chicago) + geom_tile(data = LatLonCounts, aes(x = Long, y = Lat, alpha = Freq), fill="red")+ ggtitle("Motor Vehicle Theft ")+labs(alpha="Count")+theme(plot.title = element_text(hjust=0.5))

Domestic crimes show concentration over two areas whereas motor vehicle theft is wide spread over large part of the city of Chicago.

In this post, we saw how to use R to analyse data stored in MongoDB, top NoSQL database engine in use today. When dealing with large volume data, using MongoDB can give us performance boost and make our life happier

Related Post

Related Post

The code below generates streaming data and inserts it to a local PostgreSQL database. The host can also be a remote server.

library(plot3D) library(RPostgreSQL) # Open up a database connection. pg = dbDriver("PostgreSQL") con = dbConnect(pg, user = 'postgres', password = "postgrestutorial", host="localhost", port=5432, dbname="") ## Let's generate latitude and longitude locations in Alexandira, VA lats=seq(from=38.80,to=38.8148,by=0.005) lons=seq(from=-77.06,to=-77.04,by=0.005) latlons=mesh(lons,lats) longitude=as.vector(latlons$x) latitude=as.vector(latlons$y) sensorID=paste0("SN",10001:10015) # senorIDS that help to identify trash cans # Insert data to the database # All 15 sensors will send data every 2 seconds for 20 seconds # Generating random data in increasing order: ratio of volume full i=0 status=matrix(0,ncol=15,nrow=1) while(i=1]=1 sensordata=data.frame(timestamp=Sys.time(), sensorID=sensorID, longitude=longitude, latitude=latitude, status=status, stringsAsFactors = FALSE ) dbWriteTable(con,'trashcan_sensor_data',sensordata, row.names=FALSE,append=TRUE) i=i+1 Sys.sleep(2) } # Insert data to the database # Let's assume two sensors are not sending data # 13 sensors will send data every 2 seconds for 20 seconds i=0 while(i=1]=1 sensordata=data.frame(timestamp=Sys.time(), sensorID=sensorID, longitude=longitude, latitude=latitude, status=status, stringsAsFactors = FALSE ) dbWriteTable(con,'trashcan_sensor_data',sensordata, row.names=FALSE,append=TRUE) i=i+1 Sys.sleep(2) } # Insert data to the database # Now, let's assume five sensors are not sending data # 10 sensors will send data every 2 seconds for 20 seconds ## Assume the gateway will fail fater 20+20+20 seconds and hence all ## sensors will not send data at all i=0 while(i=1]=1 sensordata=data.frame(timestamp=Sys.time(), sensorID=sensorID, longitude=longitude, latitude=latitude, status=status, stringsAsFactors = FALSE ) dbWriteTable(con,'trashcan_sensor_data',sensordata, row.names=FALSE,append=TRUE) i=i+1 Sys.sleep(2) }

The code blow is the **server.R** part of the shiny app.

library(shiny) library(leaflet) library(plotly) library(dplyr) trashicons=function(condition){ # a function to make our own icons based on the sensor data makeIcon( iconUrl =paste0("trashcan_",condition,".png"), iconWidth =40, iconHeight =45, iconAnchorX = 0, iconAnchorY = 0 )} localdb % # connect to tables within that database collect() # get the data # get the most recent row for each sensor data_at_start=arrange(data_at_start,-row_number()) data_at_start=distinct(data_at_start,sensorID,.keep_all = TRUE) # If the cans are 95% full, change the icons to warning icon # If a sensor for any can is not sending data, change the icon to failed icon # If it has been more than 10 seconds since any sensor sent data, change the icon to failed data_at_start$condition=ifelse(data_at_start$status>0.95,"warning","ok") data_at_start$condition[is.na(data_at_start$status)]="failed" data_at_start$condition[is.null(data_at_start$status)]="failed" data_at_start$condition=ifelse((data_at_start$timestamp+10)< Sys.time(),"failed",data_at_start$condition) testfunction %collect() df$max } ReadAllSensorData =function(){ # A function that gets data from the database # based on the testfunction above query="SELECT * FROM trashcan_sensor_data" temp=tbl(localdb,sql(query))%>%collect(n = Inf) temp } shinyServer(function(input, output,session) { sensorData <- reactivePoll(100, session,testfunction, ReadAllSensorData) # 100: number of milliseconds to wait between calls to testfunction output$leaflet_map %fitBounds(right,bottom,left,top)%>% addTiles()%>% addMarkers(data=dat, lng= ~ longitude, lat= ~ latitude, icon = ~trashicons(dat$condition), label=~as.character(sensorID), labelOptions = labelOptions(textOnly = T,noHide =FALSE) ) }) last_data=reactive({ # every time there is new data, get the last row for each sensorID # If the cans are 95% full, change the icons to warning icon # If a sensor for any can is not sending data, change the icon to failed icon # If it has been more than 10 seconds since any sensor sent data, change the icon to failed dat=sensorData() dat=arrange(dat,-row_number()) dat=distinct(dat,sensorID,.keep_all = TRUE) dat$condition=ifelse(dat$status>0.95,"warning","ok") dat$condition[is.na(dat$status)]="failed" dat$condition[is.null(dat$status)]="failed" dat$condition=ifelse((dat$timestamp+10)< Sys.time(),"failed",dat$condition) dat }) isdata_still_coming Sys.time() # return false if no data came in the last 10 seconds }) condition_observer=reactiveValues(condition=data_at_start$condition) # use leafletProxy to manage the dynamic icons: change icon color based on the data observe({ dat=last_data() if(isdata_still_coming()==FALSE){ leafletProxy("leaflet_map", data = last_data()) %>% clearMarkers() %>% addMarkers(data=last_data(), lng= ~ longitude, lat= ~ latitude, icon = ~trashicons("failed"), label=~as.character(sensorID), labelOptions = labelOptions(textOnly = T) ) condition_observer$condition="failed" }else if(any(condition_observer$condition!=dat$condition)){ leafletProxy("leaflet_map", data = dat) %>% clearMarkers() %>% addMarkers(data=dat, lng= ~ longitude, lat= ~ latitude, icon = ~trashicons(dat$condition), label=~as.character(sensorID), labelOptions = labelOptions(textOnly = T) ) condition_observer$condition=dat$condition } }) # Time series plot:helps us to verify that the icons are changing based on conditions we provided output$plotly_timeseries <- renderPlotly({ dat= sensorData() z=ggplot(dat,aes(x=timestamp,y=status*100,color=sensorID))+geom_line()+ geom_hline(yintercept = 95,linetype="dotted",color="red")+ylab("Status")+ ylim(0,110)+xlab("")+ggtitle("Trash Can Status")+ylab("Percet Full")+ theme(axis.title.y = element_text(colour="blue",size=14), axis.text = element_text(colour="darkred",size=12), plot.title = element_text(colour="darkgreen",size=16,hjust=0.5)) ggplotly(z) }) })

The code below is the **ui.R** part of the shiny app.

library(shiny) library(leaflet) library(plotly) shinyUI(fluidPage( tags$h3("Using PostgreSQL and shiny with a dynamic leaflet map: monitoring trash cans", style="text-align:center;color:#990099"), column(width=6, leafletOutput("leaflet_map",height = "680px") ), column(width=6, plotlyOutput("plotly_timeseries",height = "680px")) ))

Code is on Github

Related Post

Related Post

- Analytical and Numerical Solutions to Linear Regression Problems
- How to create a loop to run multiple regression models
- Regression model with auto correlated errors – Part 3, some astrology
- Regression model with auto correlated errors – Part 2, the models
- Regression model with auto correlated errors – Part 1, the data

In the first part of this exercise, we will build a logistic regression model to predict whether a student gets admitted into a university. Suppose that you are the administrator of a university department and you want to determine each applicant’s chance of admission based on their results on two exams. You have historical data from previous applicants that you can use as a training set for logistic regression. For each training example, you have the applicant’s scores on two exams and the admissions decision. Our task is to build a classification model that estimates an applicant’s probability of admission based the scores from those two exams.

library(ggplot2) library(data.table) library(magrittr)

The first two columns contains the exam scores and the third column contains the label.

ex2data1 <- fread("ex2data1.txt",col.names=c("Exam1","Exam2","Label")) head(ex2data1)Exam1 Exam2 Label 34.62366 78.02469 0 30.28671 43.89500 0 35.84741 72.90220 0 60.18260 86.30855 1 79.03274 75.34438 1 45.08328 56.31637 0

Before starting to implement any learning algorithm, it is always good to visualize the data if possible.

cols %ggplot(aes(x=Exam1,y=Exam2,color=factor(Label)))+geom_point(size = 4, shape = 19,alpha=0.6)+ scale_colour_manual(values = cols,labels = c("Not admitted", "Admitted"),name="Admission Status")

This is the formula:

Let’s code the sigmoid function so that we can call it in the rest of our programs.

my_sigmoid=function(z){ 1/(1+exp(-z)) }

For large positive values of x, the sigmoid should be close to 1, while for large negative values, the sigmoid should be close to 0. Evaluating sigmoid(0) should give exactly 0.5.

Let’s check!

my_sigmoid(0)0.5

my_sigmoid(10^10)1

my_sigmoid(-10^10)0

We can visualize the sigmoid function graphically:

x=seq(-10,10, by=0.1) y=my_sigmoid(x) plot(x,y,type="l",col="red",xlab="",ylab="",main="Sigmoid function (Logistic curve)") abline(v=0,lty = 2,col="gray 70") abline(h=0.5,lty = 2,col="gray70")

This is the formula:

Add ones for the intercept term:

y=ex2data1$Label X=cbind(1,ex2data1$Exam1,ex2data1$Exam2) head(X)1 34.62366 78.02469 1 30.28671 43.89500 1 35.84741 72.90220 1 60.18260 86.30855 1 79.03274 75.34438 1 45.08328 56.31637

Initialize fitting parameters:

initial_theta =matrix(rep(0,ncol(X)))

The function below calculates cost.

my_cost=function(theta, X, y){ cost_gradient=list() h_theta= my_sigmoid(X%*%theta) cost= 1/nrow(X)*sum(-y*log(h_theta)-(1-y)*log(1-h_theta)) return(cost) }

What is the cost for the initial theta parameters, which are all zeros?

round(my_cost(initial_theta, X, y),3)0.693

The function below calculates gradient.

my_gradient=function(theta, X, y){ h_theta= my_sigmoid(X%*%theta) ## OPTION 1-- looping # gradient=rep(0,ncol(X)) # for(j in 1:ncol(X)){ # for(i in 1:nrow(X)){ # gradient[j]=gradient[j]+1/nrow(X)*(my_sigmoid(X[i,]%*%theta)-y[i])*X[i,j] # } # } # option2-more succint gradient= 1/nrow(X)*(my_sigmoid(t(theta)%*%t(X))-t(y))%*%X return(gradient) }

The gradient for the initial theta parameters, which are all zeros, is shown below

round(my_gradient(initial_theta, X, y),3)-0.1 -12.009 -11.263

We can use gradient descent to get the optimal theta values but using optimazation libraries converges quicker. So, let’s use the optim general-purpose Optimization in R to get the required theta values and the associated cost.

optimized=optim(par=initial_theta,X=X,y=y,fn=my_cost,gr=my_gradient) names(optimized)'par' 'value' 'counts' 'convergence' 'message'

cat("The optimized theta values are: ") round(optimized$par,3)The optimized theta values are: -25.165 0.206 0.202

cat('The cost at the final theta values is: ') round(optimized$value,3)The cost at the final theta values is: 0.203

Now, let’s plot the decision boundary. Only 2 points are required to define a line, so let’s choose two endpoints.

plot_x = c(min(ex2data1$Exam1)-2,max(ex2data1$Exam1)+2) # Calculate the decision boundary line plot_y = (-1/optimized$par[3])*(optimized$par[2]*plot_x+optimized$par[1]) ## Calculate slope and intercept slope =(plot_y[2]-plot_y[1])/(plot_x[2]-plot_x[1]) intercept=plot_y[2]-slope*plot_x[2] cols %ggplot(aes(x=Exam1,y=Exam2,color=factor(Label)))+geom_point(size = 4, shape = 19,alpha=0.6)+ scale_colour_manual(values = cols,labels = c("Not admitted", "Admitted"),name="Admission Status")+ geom_abline(intercept = intercept, slope = slope,col="purple",show.legend=TRUE)

After learning the parameters, you can use the model to predict whether a particular student will be admitted. For example, for a student with an Exam 1 score of 45 and an Exam 2 score of 85, the probability of admission is shown below.

round(my_sigmoid(c(1,45,85)%*%optimized$par),3)0.776

Now, let’s calculate the model accuracy. Let’s use a threshould of 0.5.

ex2data1$fitted_result=my_sigmoid(X%*%optimized$par) ex2data1$fitted_result_label=ifelse(ex2data1$fitted_result>=0.5,1,0) head(ex2data1)Exam1 Exam2 Label fitted_result fitted_result_label 34.62366 78.02469 0 9.102720e-02 0 30.28671 43.89500 0 4.220398e-05 0 35.84741 72.90220 0 4.389943e-02 0 60.18260 86.30855 1 9.904316e-01 1 79.03274 75.34438 1 9.982001e-01 1 45.08328 56.31637 0 1.079133e-02 0

accuracy=sum(ex2data1$Label==ex2data1$fitted_result_label)/(nrow(ex2data1)) accuracy0.89

In this part of the exercise, we will implement regularized logistic regression to predict whether microchips from a fabrication plant passes quality assurance (QA). During QA, each microchip goes through various tests to ensure it is functioning correctly. Suppose you are the product manager of the factory and you have the test results for some microchips on two different tests. From these two tests, you would like to determine whether the microchips should be accepted or rejected.

ex2data2 <- fread("ex2data2.txt",col.names=c("Test1","Test2","Label")) head(ex2data2)Test1 Test2 Label 0.051267 0.69956 1 -0.092742 0.68494 1 -0.213710 0.69225 1 -0.375000 0.50219 1 -0.513250 0.46564 1 -0.524770 0.20980 1

cols %ggplot(aes(x=Test1,y=Test2,color=factor(Label)))+geom_point(size = 4, shape = 19,alpha=0.6)+ scale_colour_manual(values = cols,labels = c("Failed", "Passed"),name="Test Result")

The above figure shows that our dataset cannot be separated into positive and negative examples by a straight-line through the plot. Therefore, a straightforward application of logistic regression will not perform well on this dataset since logistic regression will only be able to find a linear decision boundary.

One way to fit the data better is to create more features from each data point. Let’s map the features into all polynomial terms of x1 and x2 up to the sixth power.

new_data=c() for(i in 1:6){ for(j in 0:i){ temp= (ex2data2$Test1)^i+(ex2data2$Test2)^(i-j) new_data=cbind(new_data,temp) } } colnames(new_data)=paste0("V",1:ncol(new_data)) head(new_data)V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 ... V18 V19 V20 V21 V22 V23 V24 V25 V26 V27 0.750827 1.051267 0.4920125 0.7021883 1.002628 0.34248835 0.48951894 0.69969475 1.0001347 0.23950380 ... 0.489384548 0.6995604 1.0000004 0.11720601 0.16754246 0.23949691 0.34235362 0.48938421 0.6995600 1.000000 0.592198 0.907258 0.4777439 0.6935411 1.008601 0.32053699 0.46834512 0.68414232 0.9992023 0.22016895 ... 0.469135943 0.6849331 0.9999931 0.10325661 0.15075249 0.22009561 0.32133531 0.46914344 0.6849406 1.000001 0.478540 0.786290 0.5248820 0.7379220 1.045672 0.32197261 0.46944951 0.68248944 0.9902394 0.23172821 ... 0.478764279 0.6918042 0.9995542 0.11014216 0.15906514 0.22973755 0.33182843 0.47930533 0.6923453 1.000095 0.127190 0.625000 0.3928198 0.6428150 1.140625 0.07391533 0.19946042 0.44945563 0.9472656 0.08337761 ... 0.244779025 0.4947742 0.9925842 0.01882106 0.03472131 0.06638313 0.12943062 0.25497571 0.5049709 1.002781 -0.047610 0.486750 0.4802462 0.7290656 1.263426 -0.03424282 0.08161744 0.33043683 0.8647968 0.11640420 ... 0.181204639 0.4300240 0.9643840 0.02847289 0.04017018 0.06529107 0.11924025 0.23510051 0.4839199 1.018280 -0.314970 0.475230 0.3193996 0.4851836 1.275384 -0.13527846 -0.10049699 0.06528697 0.8554870 0.07777351 ... 0.004219529 0.1700035 0.9602035 0.02096929 0.02129048 0.02282143 0.03011858 0.06490005 0.2306840 1.020884

y=ex2data2$Label X=new_data

As a result of this mapping, our vector of two features (the scores on two QA tests) has been transformed into a 28-dimensional vector. A logistic regression classifier trained on this higher-dimension feature vector will have a more complex decision boundary and will appear nonlinear when drawn in our 2-dimensional plot. While the feature mapping allows us to build a more expressive classifier, it also more susceptible to overfitting. In the next parts of the exercise, we will implement regularized logistic regression to fit the data and also see for ourselves how regularization can help combat the overfitting problem.

my_cost2=function(theta, X, y,lambda){ cost_gradient=list() # cost_and_gradient compute cost and gradient for logistic regression using theta h_theta= my_sigmoid(X%*%theta) cost= 1/nrow(X)*sum(-y*log(h_theta)-(1-y)*log(1-h_theta))+lambda/(2*nrow(X))*sum(theta^2) return(cost) } my_gradient2=function(theta, X, y,lambda){ h_theta= my_sigmoid(X%*%theta) ## OPTION 1-- looping gradient=rep(0,ncol(X)) for(j in 2:ncol(X)){ for(i in 1:nrow(X)){ gradient[1]=gradient[1]+1/nrow(X)*(my_sigmoid(X[i,]%*%theta)-y[i])*X[i,1] gradient[j]=gradient[j]+1/nrow(X)*(my_sigmoid(X[i,]%*%theta)-y[i])*X[i,j]+lambda/(nrow(X))*theta[j] } } # option2-more succint # gradient= 1/nrow(X)*(my_sigmoid(t(theta)%*%t(X))-t(y))%*%X+(lambda/(nrow(X)))*c(0,theta[-1]) return(gradient) }

Let’s initialize theta:

initial_theta =matrix(rep(0,ncol(X)))

What is the cost for the initial theta?

round(my_cost2(initial_theta,X,y,lambda=10),3)0.693

Now, since we have the cost function that we want to optimize and the gradient, we can use the optimization function optim to find the optimal theta values.

We have to try various values of lambda and select the best lambda based on cross-validation. But for now, let’s just take lambda=1.

optimized=optim(par=initial_theta,X=X,y=y,lambda=1,fn=my_cost2,gr=my_gradient2,method="BFGS")

The theta values from the optimization are shown below.

matrix(as.vector(optimized$par),ncol=9)0.49507083 -0.05202124 -0.08527443 -0.1575100 -0.035020544 -0.11079198 0.01169745 -0.12174228 -0.12234383 0.05639364 -0.02062364 0.01959276 -0.1129876 -0.003622936 -0.06626957 0.04309506 -0.13996607 -0.01747664 -0.15688844 -0.05837426 0.05099037 -0.1398877 -0.092568188 -0.09316975 -0.14349836 -0.09544366 0.01392097

ex2data2$fitted_result=my_sigmoid(X%*%optimized$par)

Let’s use a threshold of 0.5.

ex2data2$fitted_result_label=ifelse(ex2data2$fitted_result>=0.5,1,0) head(ex2data2,10)Test1 Test2 Label fitted_result fitted_result_label 0.051267 0.699560 1 0.4765027 0 -0.092742 0.684940 1 0.4629791 0 -0.213710 0.692250 1 0.4410811 0 -0.375000 0.502190 1 0.4701446 0 -0.513250 0.465640 1 0.4459059 0 -0.524770 0.209800 1 0.4554844 0 -0.398040 0.034357 1 0.4730207 0 -0.305880 -0.192250 1 0.4618593 0 0.016705 -0.404240 1 0.4735035 0 0.131910 -0.513890 1 0.4640032 0

Now, we can evaluate the fit by calculating various metrics such as F1 score, precision and recall. Let’s just see accuracy here. You can see the values of the other metrics here.

accuracy=sum(ex2data2$Label==ex2data2$fitted_result_label)/nrow(ex2data2) cat("The accuracy is: ") round(accuracy,3)The accuracy is: 0.653

Logistic regression is a classification machine learning technique. In this blog post, we saw how to implement logistic regression with and without regularization. In the first part, we used it to predict if a student will be admitted to a university and in the second part, we used it to predict whether microchips from a fabrication plant pass quality assurance. We used special optimization function in lieu of gradient descent to get the optimal values of the coefficients. The data sets are from the Coursera machine learning course offered by Andrew Ng. The course is offered with Matlab/Octave. I am doing the exercises in that course with R. You can get the code from this Github repository.

Related Post

- Analytical and Numerical Solutions to Linear Regression Problems
- How to create a loop to run multiple regression models
- Regression model with auto correlated errors – Part 3, some astrology
- Regression model with auto correlated errors – Part 2, the models
- Regression model with auto correlated errors – Part 1, the data

Related Post

- Logistic Regression Regularized with Optimization
- How to create a loop to run multiple regression models
- Regression model with auto correlated errors – Part 3, some astrology
- Regression model with auto correlated errors – Part 2, the models
- Regression model with auto correlated errors – Part 1, the data

In this part of this exercise, we will implement linear regression with one variable to predict profits for a food truck. Suppose you are the CEO of a restaurant franchise and are considering different cities for opening a new outlet. The chain already has trucks in various cities and you have data for profits and populations from cities. We would like to use this data to help us select which city to expand to next.

The file ex1data1.txt contains the dataset for our linear regression problem. The first column is the population of a city and the second column is the profit of a food truck in that city. A negative value for profit indicates a loss. The first column refers to the population size in 10,000s and the second column refers to the profit in $10,000s.

library(ggplot2) library(data.table) library(magrittr) library(caret) library(fields) library(plot3D)

**Load the data and display first 6 observations**

ex1data1 <- fread("ex1data1.txt",col.names=c("population","profit")) head(ex1data1)population profit 1: 6.1101 17.5920 2: 5.5277 9.1302 3: 8.5186 13.6620 4: 7.0032 11.8540 5: 5.8598 6.8233 6: 8.3829 11.8860

Before starting on any task, it is often useful to understand the data by visualizing it. For this dataset, we can use a scatter plot to visualize the data, since it has only two properties to plot (profit and population). Many other problems that we encounter in real life are multi-dimensional and can’t be plotted on a 2-d plot.

ex1data1%>%ggplot(aes(x=population, y=profit))+ geom_point(color="blue",size=4,alpha=0.5)+ ylab('Profit in $10,000s')+ xlab('Population of City in 10,000s')+ggtitle ('Figure 1: Scatter plot of training data')+ theme(plot.title = element_text(size = 16,colour="red"))

In this part, we will fit linear regression parameters θ to our dataset using gradient descent.

To implement gradient descent, we need to calculate the cost, which is given by:

As we perform gradient descent to learn minimize the cost function J(θ), it is helpful to monitor the convergence by computing the cost. In this section, we will implement a function to calculate J(θ) so we can check the convergence of our gradient descent implementation.

But remember, before we go any further we need to add the θ intercept term.

X=cbind(1,ex1data1$population) y=ex1data1$profit head(X)[,1] [,2] [1,] 1 6.1101 [2,] 1 5.5277 [3,] 1 8.5186 [4,] 1 7.0032 [5,] 1 5.8598 [6,] 1 8.3829

The function below calcuates cost based on the equation given above.

computeCost=function(X,y,theta){ z=((X%*%theta)-y)^2 return(sum(z)/(2*nrow(X))) }

Now, we can calculate the initial cost by initilizating the initial parameters to 0.

theta=matrix(rep(0,ncol(X))) round(computeCost(X,y,theta),2)32.07

Next, we will implement gradient descent by calling the *computeCost* function above. A good way to verify that gradient descent is working correctly is to look at the value of J(θ) and check that it is decreasing with each step. Our value of J(θ) should never increase, and should converge to a steady value by the end of the algorithm.

The gradient descent function below returns the cost in every iteration and the optimal parameters for the number of iterations and learning rate alpha specified.

gradientDescent=function(X, y, theta, alpha, iters){ gd=list() cost=rep(0,iters) for(k in 1:iters){ z=rep(0,ncol(X)) for(i in 1:ncol(X)){ for(j in 1:nrow(X)){ z[i]=z[i]+(((X[j,]%*%theta)-y[j])*X[j,i]) } } theta= theta-((alpha/nrow(X))*z) cost[k]=computeCost(X,y,theta) } gd$theta= theta gd$cost=cost gd }

Now, let’s use the *gradientDescent* function to find the parameters and we have to make sure that our cost never increases. Let’s use 1500 iterations and a learning rate alpha of 0.04 for now. Later, we will see the effect of these values in our application.

iterations = 1500 alpha = 0.01 theta= matrix(rep(0, ncol(X))) gradientDescent_results=gradientDescent(X,y,theta,alpha,iterations) theta=gradientDescent_results$theta theta[,1] [1,] -3.630291 [2,] 1.166362

We expect that the cost should be decreasing or at least should not be at all increasing if our implementaion is correct. Let’s plot the cost fucntion as a function of number of iterations and make sure our implemenation makes sense.

data.frame(Cost=gradientDescent_results$cost,Iterations=1:iterations)%>% ggplot(aes(x=Iterations,y=Cost))+geom_line(color="blue")+ ggtitle("Cost as a function of number of iteration")+ theme(plot.title = element_text(size = 16,colour="red"))

Here it is the plot:

As the plot above shows, the cost decrases with number of iterations and gets almost close to convergence with 1500 iterations.

Now, since we know the parameters (slope and intercept), we can plot the linear fit on the scatter plot.

ex1data1%>%ggplot(aes(x=population, y=profit))+ geom_point(color="blue",size=3,alpha=0.5)+ ylab('Profit in $10,000s')+ xlab('Population of City in 10,000s')+ ggtitle ('Figure 1: Scatter plot of training data') + geom_abline(intercept = theta[1], slope = theta[2],col="red",show.legend=TRUE)+ theme(plot.title = element_text(size = 16,colour="red"))+ annotate("text", x = 12, y = 20, label = paste0("Profit = ",round(theta[1],4),"+",round(theta[2],4)," * Population"))

To understand the cost function J(θ) better, let’s now plot the cost over a 2-dimensional grid of θ0 and θ1 values. The global minimum is the optimal point for θ0 and θ1, and each step of gradient descent moves closer to this point.

Intercept=seq(from=-10,to=10,length=100) Slope=seq(from=-1,to=4,length=100) # initialize cost values to a matrix of 0's Cost = matrix(0,length(Intercept), length(Slope)); for(i in 1:length(Intercept)){ for(j in 1:length(Slope)){ t = c(Intercept[i],Slope[j]) Cost[i,j]= computeCost(X, y, t) } } persp3D(Intercept,Slope,Cost,theta=-45, phi=25, expand=0.75,lighting = TRUE, ticktype="detailed", xlab="Intercept", ylab="Slope", zlab="",axes=TRUE, main="Surface") image.plot(Intercept,Slope,Cost, main="Contour, showing minimum") contour(Intercept,Slope,Cost, add = TRUE,n=30,labels='') points(theta[1],theta[2],col='red',pch=4,lwd=6)

Since linear regression has closed-form solution, we can solve it analytically and it is called normal equation. It is given by the formula below. we do not need to iterate or choose learning curve. However, we need to calculate inverse of a matrix , which make it slow if the number of records is very large. Gradient descent is applicable to other machine learning techniques as well. Further, gradient descent method is more appropriate even to linear regression when the number of observations is very large.

θ=(X^{T}X)^{−1}X^{T}y

theta2=solve((t(X)%*%X))%*%t(X)%*%y theta2[,1] [1,] -3.895781 [2,] 1.193034

There is very small difference between the parameters we got from normal equation and using gradient descent. Let’s increase the number of iteration and see if they get closer to each other. I increased the number of iterations from 1500 to 15000.

iterations = 15000 alpha = 0.01 theta= matrix(rep(0, ncol(X))) gradientDescent_results=gradientDescent(X,y,theta,alpha,iterations) theta=gradientDescent_results$theta theta[,1] [1,] -3.895781 [2,] 1.193034

As you can see, now the results from normal equation and gradient descent are the same.

By the way, we can use packages develoed by experts in the field and perform our machine learning tasks. There are many machine learning packages in R for differnt types of machine learning tasks. To verify that we get the same results, let’s use the caret package, which is among the most commonly used machine learning packages in R.

my_lm <- train(profit~population, data=ex1data1,method = "lm") my_lm$finalModel$coefficients(Intercept) -3.89578087831187 population 1.1930336441896

In this part of the exercise, we will implement linear regression with multiple variables to predict the prices of houses. Suppose you are selling your house and you want to know what a good market price would be. One way to do this is to first collect information on recent houses sold and make a model of housing prices. The file ex1data2.txt contains a training set of housing prices in Port- land, Oregon. The first column is the size of the house (in square feet), the second column is the number of bedrooms, and the third column is the price of the house.

ex1data2 <- fread("ex1data2.txt",col.names=c("size","bedrooms","price")) head(ex1data2)size bedrooms price 1: 2104 3 399900 2: 1600 3 329900 3: 2400 3 369000 4: 1416 2 232000 5: 3000 4 539900 6: 1985 4 299900

House sizes are about 1000 times the number of bedrooms. When features differ by orders of magnitude, first performing feature scaling can make gradient descent converge much more quickly.

ex1data2=as.data.frame(ex1data2) for(i in 1:(ncol(ex1data2)-1)){ ex1data2[,i]=(ex1data2[,i]-mean(ex1data2[,i]))/sd(ex1data2[,i]) }

Previously, we implemented gradient descent on a univariate regression problem. The only difference now is that there is one more feature in the matrix X. The hypothesis function and the batch gradient descent update rule remain unchanged.

X=cbind(1,ex1data2$size,ex1data2$bedrooms) y=ex1data2$price head(X) iterations = 6000 alpha = 0.01 theta= matrix(rep(0, ncol(X))) gradientDescent_results=gradientDescent(X,y,theta,alpha,iterations) theta=gradientDescent_results$theta theta[,1] [,2] [,3] [1,] 1 0.13000987 -0.2236752 [2,] 1 -0.50418984 -0.2236752 [3,] 1 0.50247636 -0.2236752 [4,] 1 -0.73572306 -1.5377669 [5,] 1 1.25747602 1.0904165 [6,] 1 -0.01973173 1.0904165 [,1] [1,] 340412.660 [2,] 110631.050 [3,] -6649.474

theta2=solve((t(X)%*%X))%*%t(X)%*%y theta2[,1] [1,] 340412.660 [2,] 110631.050 [3,] -6649.474

ex1data2 <- fread("ex1data2.txt",col.names=c("size","bedrooms","price")) my_lm <- train(price~size+bedrooms, data=ex1data2,method = "lm", preProcess = c("center","scale")) my_lm$finalModel$coefficients(Intercept) 340412.659574468 size 110631.050278846 bedrooms -6649.4742708198

In this post, we saw how to implement numerical and analytical solutions to linear regression problems using R. We also used `caret`

-the famous R machine learning package- to verify our results. The data sets are from the Coursera machine learning course offered by Andrew Ng. The course is offered with Matlab/Octave. I am doing the exercises in that course with R. You can get the code from this Github repository.

Related Post

- Logistic Regression Regularized with Optimization
- How to create a loop to run multiple regression models
- Regression model with auto correlated errors – Part 3, some astrology
- Regression model with auto correlated errors – Part 2, the models
- Regression model with auto correlated errors – Part 1, the data

Related Post

This week short blog post is on visualizing streaming data using shiny. The dashboard updates automatically to incorporate newly added data. In the end of this post you will find a video tutorial on how to send text and email alerts, based on streaming sensor data, when there is an abnormal reading or when a sensor fails.

library(ggplot2) library(dplyr) data("diamonds") while(TRUE){ temp=sample_frac(diamonds,0.1) write.csv(temp, paste0("sampled", gsub("[^0-9]","",Sys.time()),".csv"), row.names = FALSE) Sys.sleep(10) # Suspend execution of R expressions. The time interval to suspend execution for, in seconds. }

Then, the following are the shiny code that fetch the newly generated data and create a shiny dashboard.

**ui.R**

library(shiny) fluidPage( tags$h2("Visualizing Streaming Data with Shiny", style="color:blue;text-align:center"), plotOutput("plot1",height = "600px") )

**server.R**

library(shiny) library(data.table) library(ggplot2) library(gridExtra) library(readr) IsThereNewFile=function(){ # cheap function whose values over time will be tested for equality; # inequality indicates that the underlying value has changed and needs to be # invalidated and re-read using valueFunc filenames <- list.files(pattern="*.csv", full.names=TRUE) length(filenames) } ReadAllData=function(){ # A function that calculates the underlying value filenames <- list.files(pattern="*.csv", full.names=TRUE) read_csv(filenames[length(filenames)]) } function(input, output, session) { sampled_data <- reactivePoll(10, session,IsThereNewFile, ReadAllData) # 10: number of milliseconds to wait between calls to checkFunc output$plot1<-renderPlot({ sampled_data= sampled_data() g1= ggplot(sampled_data, aes(depth, fill = cut, colour = cut)) + geom_density(alpha = 0.1) +xlim(55, 70)+ggtitle("Distribution of Depth by Cut")+ theme(plot.title = element_text(color="darkred",size=18,hjust = 0.5), axis.text.y = element_text(color="blue",size=12,hjust=1), axis.text.x = element_text(color="darkred",size=12,hjust=.5,vjust=.5), axis.title.x = element_text(color="red", size=14), axis.title.y = element_text(size=14)) g2=ggplot(sampled_data, aes(carat, ..count.., fill = cut)) + geom_density(position = "stack")+ggtitle("Total Carat by Count")+ theme(plot.title = element_text(color="purple",size=18,hjust = 0.5), axis.text.y = element_text(color="blue",size=12,hjust=1), axis.text.x = element_text(color="darkred",size=12,hjust=.5,vjust=.5), axis.title.x = element_text(color="red", size=14), axis.title.y = element_text(size=14)) g3=ggplot(sampled_data, aes(carat, ..count.., fill = cut)) + geom_density(position = "fill")+ggtitle("Conditional Density Estimate")+ theme(plot.title = element_text(color="black",size=18,hjust = 0.5), axis.text.y = element_text(color="blue",size=12,hjust=1), axis.text.x = element_text(color="darkred",size=12,hjust=.5,vjust=.5), axis.title.x = element_text(color="red", size=14), axis.title.y = element_text(size=14)) g4=ggplot(sampled_data,aes(carat,price))+geom_boxplot()+facet_grid(.~cut)+ ggtitle("Price by Carat for each cut")+ theme(plot.title = element_text(color="darkblue",size=18,hjust = 0.5), axis.text.y = element_text(color="blue",size=12,hjust=1), axis.text.x = element_text(color="darkred",size=12,hjust=.5,vjust=.5), axis.title.x = element_text(color="red", size=14), axis.title.y = element_text(size=14)) grid.arrange(g1,g2,g3,g4) }) }

Related Post