Modeling with R

Example Talk

Sat, 01 Jun 2030 13:00:00 +0000

Click on the Slides button above to view the built-in slides feature.

Slides can be added in a few ways:

Create slides using Academic’s Slides feature and link using slides parameter in the front matter of the talk file
Upload an existing slide deck to static/ and link using url_slides parameter in the front matter of the talk file
Embed your slides (e.g. Google Slides) or presentation video on this page using shortcodes.

Further talk details can easily be added to this page using Markdown and $\rm \LaTeX$ math code.

Predicting large text data with spark via the R package sparklyr

Thu, 02 Jul 2020 00:00:00 +0000

1 Abstract

Unlike the classical programming languages that are very slow and even sometimes fail to load very large data sets since they use only a single core, Apache Spark is known as the fastest distributed system that can handle with ease large datasets by deploying all the available machines and cores to build cluster, so that the computing time of each task performed on the data will be drastically reduced since each worker node in the cluster takes in charge small part of the task in question. Even that the native language of spark is scala (but it can also support java and sql), the good news for R users is that they can benefit from spark without having to learn the above supported languages by making use of the R package sparklyr. In this article we trained random forest model using text data which is in practice known as large data set. for illustration purposes and to make things faster however we used a small data set about email messages and also constrained ourselves to use the local mode in which spark created a cluster from the available cores in my machine. Notice that the same codes in this paper can be used in the cloud whatever the size of our data, even with billions of data points, except for the connection method to spark which is slightly different. Since the raw data requires some transformation to be consumed by the model, we applied the well-known method called tokenization to create the model features, then trained and evaluated a random forest model applied on the design matrix after having been filled using the TF method. Lastly, we trained the same model (random forest model with the same hyperparameter values) using another method called TF-IDF method (Sparck , 1972).

2 Keywords

Large dataset, R, spark, sparklyr, cluster, tokenization, TF, TF-IDF, random forest model, machine learning.

3 Introduction

R is one of the best programming languages for statistical analysis, and provides data scientist by super powerful tools that make their work super easy and more exciting. However, since the amount of information today is growing exponentially, R and all the classical languages (python, java,…etc.) that use one single machine (one single core node) would face a great challenges to handle and deal with large dataset that, in some cases, its size can even exceed the memory size. As a solution to the above classical programming language limitations, spark and hadoop are two new systems. Both use a computing distributed system that run multiple tasks using multiple machines (called nodes, and together called cluster) at the same time. However, spark has the superiority over hadoop by its ability to load the data in memory which makes it much higher faster (Luraschi, 2014). Spark creates a cluster using either physical machines or virtual machines provided by some cloud provider such as google, amazon, microsoft…etc (it can also creat a cluster using the available cores in a single machine known as local mode). Its native language is scala, but also can support sql and java. Thankfully, spark provides a high level APIs in python and R so that the R users can use spark as a platform to work with large datasets using their familiar codes and without having to learn scala, sql or java. However, the connection between R and spark is not straightforward, it is set by the help of sparklyr package, which is like any other R packages, with its own functions and supports almost all the famous dplyr R package functions. Usually, most of text data are considered as large datasets, either due to their large sizes or the large computing time required for their manipulations or modulizations. That is why, in this paper, we will train Random forest model using sparklyr to predict whether a text message is spam or ham from the data set SMSSpamCollection uploaded from kaggle website. To convert the character features to numeric type we will use two famous methods , TF transformation, and TF-IDF (Jones, 1972) transformation. This article will be divided into the following sections:

Data Preparation: we will illustrate how do we read, clean, and prepare the data to be consumed by the model.
TF Method: we will train a random forest model (James et al, 2013) on the term frequency TF features.
TF-IDF method: We will train the random forest model on the TF_IDF features.
Add features: we will create another feature from the data to be used as a new predictor.

4 Data preparation

First, we call the R packages tidyverse and sparklyr, and we set up the connection to spark using the following R codes.

suppressPackageStartupMessages(library(sparklyr))
suppressPackageStartupMessages(library(tidyverse))
sc<-spark_connect(master = "local")

Second, we call the data that has been uploaded and saved in my R directory (notice that the data does not have column headers), and we use the glimpse function to get a first glance.

path <- "C://Users/dell/Documents/SMSSpamCollection.txt"
mydata<-spark_read_csv(sc,name="SMS",path=path, header=FALSE,delimiter = "\t",overwrite = TRUE)
knitr::kable(head(mydata,3))

V1	V2
ham	Go until jurong point, crazy.. Available only in bugis n great world la e buffet… Cine there got amore wat…
ham	Ok lar… Joking wif u oni…
spam	Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C’s apply 08452810075over18’s

It will be more practical if we replace the default column names V1 and V2 by Labels and messages respectively.

names(mydata)<-c("labels","messages")

we can get the dimension of this data by using the function sdf_dim

sdf_dim(mydata)

[1] 5574    2

We can also take a look at some messages by displaying the first three rows.

select(mydata,messages)%>%
  head(3) %>% 
  knitr::kable("html")

messages
Go until jurong point, crazy.. Available only in bugis n great world la e buffet… Cine there got amore wat…
Ok lar… Joking wif u oni…
Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C’s apply 08452810075over18’s

Modeling text data requires special attention since most of the machine learning algorithms require numeric data, so how do we can transform the text entries in messages into numeric type?. The most well known approach is called tokenization, this simply means splitting each text in the column messages into small pieces called tokens (also called bag of words) in a way such that each token has meaningful effect to discriminating between the dependent variable labels. For example, if we think that arbitrary numbers or some symbols like / or dots…etc. do not have any discriminating impact then we can remove them from the entries. Each row in this data (which is labeled as ham or spam ) is considered as document ( 5574 documents in our case) that has a text (which is a collection of tokens), and the whole data after tokenization (as a rectangular matrix) is called corpus. To keep things simple let’s suppose that everything except the words are useless for predicting the labels, so we can use the function spark sql function regexp_replace to remove everything except letters, then we rename the resulted column cleaned.

newdata<-mydata%>%
  mutate(cleaned=regexp_replace(messages,"[^a-zA-Z]"," "))%>%
  mutate(cleaned=lower(cleaned))%>%
  select(labels,cleaned)
newdata%>%
  select(cleaned)%>%
  head(3)%>%
  knitr::kable()

cleaned
go until jurong point crazy available only in bugis n great world la e buffet cine there got amore wat
ok lar joking wif u oni
free entry in a wkly comp to win fa cup final tkts st may text fa to to receive entry question std txt rate t c s apply over s

At this stage and before going ahead we should split the data between training set and testing set. However, since we have an imbalanced data with roughly 87% of ham’s and 13% of spam’s, we should preserve the proportion of the labels by splitting the data in a such way to get stratified samples.

newdata%>%
  group_by(labels)%>%
  count()%>%
  collect()%>%
  mutate(prop=n/sum(n))%>%
  knitr::kable()

labels	n	prop
ham	4827	0.8659849
spam	747	0.1340151

To accomplish this task by hand, first we filter the data between ham and spam, then each set will be split randomly between training set and testing set, and next we rbind together the training sets in one set and then we do the same thing for testing sets.

dataham<-newdata%>%
  filter(labels=="ham")
dataspam<-newdata%>%
  filter(labels=="spam")
partitionham<-dataham%>%
  sdf_random_split(training=0.8,test=0.2,seed = 111)
partitionspam<-dataspam%>%
  sdf_random_split(training=0.8,test=0.2,seed = 111)

train<-sdf_bind_rows(partitionham$training,partitionspam$training)%>%
  compute("train")
test<-sdf_bind_rows(partitionham$test,partitionspam$test)%>%
  compute("test")

5 TF model

Since machine learning models require inputs as numeric data, the common practice in text analysis thus is to convert each single text into tokens (or pieces) so that these tokens will be the features that can be used to discriminate between class labels, In our case, they are a simple words. Using the TF method, if a particular word exists in a particular document we assign the number of frequency of this word (or just 1 if we do not care about the frequency) in the corresponding cell in the design matrix (which is called Document Term Matrix DTM), otherwise we assign zero. this method will give us a very large and sparse rectangular matrix with huge number of features compared to the number of documents, that is why spark can help to handle this type of data. Due to its popularity, we will fit random forest model, which known as one of the most powerful machine learning models, to the transformed data. to be brief We will make use of the spark feature pipline that helps us to group all the following required steps to enable running the model:

convert the dependent variable labels to integer type.
tokenize the cleaned messages into words (tokens).
remove stop words from the tokens since they tend to spread out randomly among documents.
replace each term in each document by its frequency number.
define the model that will be used (here random forest model).

At the final step we use ml_random_forest function and we keep all the default values, for example, 20 for number of trees, 5 for the max depth, and gini as the impurity function, and do not forget to set the seed to get the result reproducible. lastly we call the ml_fit function to fit the model.

pipline<-ml_pipeline(sc)%>%
  ft_string_indexer(input_col = "labels",output_col="class")%>%
  ft_tokenizer(input_col = "cleaned", output_col="words")%>%
  ft_stop_words_remover(input_col = "words",output_col = "cleaned_words")%>%
  ft_count_vectorizer(input_col = "cleaned_words", output_col="terms",
                      min_df=5,binary=TRUE)%>%
  ft_vector_assembler(input_cols = "terms",output_col="features")%>%
  ml_random_forest_classifier(label_col="class",
                 features_col="features",
                 seed=222)
model_rf<-ml_fit(pipline,train)

To evaluate our model we use the ml_transfrom function.

ml_transform(model_rf,train)%>%
  ml_binary_classification_evaluator(label_col = "class",
                                     metric_name= "areaUnderROC")

[1] 0.9693865

Notice that in binary classification model sparklyr provides only two metrics areaUnderROC and areaUnderPR (Murphy, 2012). Using the former metric we get high score which is about 0.97. This rate is ranged between 0 and 1, The higher the rate the best the model. However, since this rate is resulted from the training data, it might be the result of an overfitting (Lantz, 2016) problem, that is why the more reliable one is that that resulted from the testing set, , which is now 0.976.

ml_transform(model_rf,test)%>%
  ml_binary_classification_evaluator(label_col = "class",
                                     metric_name= "areaUnderROC")

[1] 0.9653819

Fortunately The two rate values are very close to each other indicating the good generalization of our model.
To get the prediction we use the ml_predict function .

pred<-ml_predict(model_rf,test)

As we see some columns are nested. This is not problem since you can extract the elements of this list using the function unlist. For instance, we can show the most used words in each class label using the package wordcloud

p1<-pred%>%
  filter(labels=="ham")%>%
  pull(cleaned_words)%>%
  unlist()
wordcloud::wordcloud(p1,max.words = 50, random.order = FALSE,
                     colors=c("blue","red","green","yellow"),random.color = TRUE)

p2<-pred%>%
  filter(labels=="spam")%>%
  pull(cleaned_words)%>%
  unlist()
wordcloud::wordcloud(p2,max.words = 50,random.order = FALSE, 
                     colors=c("blue","red","green","yellow"),random.color = TRUE)

From the upper figure we see that the most common words in hm’s are: get, good, know, whereas the lower figure shows the most ones for spam’s, which are: call, free, mobile. This means that if we receive a new email message that has the word free for instance , it will be more probable to be spam.

6 TF-IDF model

The main drawback of TF method is that it does not take into account the distribution of each term across the documents that reflects how much information each term provides. To measure the information of each term we compute its DF document frequency value which is the number of documents d where the term t appears, and hence the inverse document frequency IDF value for each pair (d,t) will be computed as follows:

\[idf(t,d)=log(\frac{N}{1+|d\epsilon D,t\epsilon d|})\]

Where N is the total number of documents (number of rows). By multiplying TF with IDF we get TF-IDF value for each term. In the above TF pipline we include the function ft_idf , then we fit again random forest model on the transformed data, and we evaluate the model directly by using the test data.

pipline2<-ml_pipeline(sc)%>%
  ft_string_indexer(input_col = "labels",output_col="class")%>%
  ft_tokenizer(input_col = "cleaned", output_col="words")%>%
  ft_stop_words_remover(input_col = "words",output_col = "cleaned_words")%>%
  ft_count_vectorizer(input_col = "cleaned_words", output_col="tf_terms")%>%
  ft_idf(input_col = "tf_terms", output_col="tfidf_terms")%>%
    ml_random_forest_classifier(label_col="class",
                 features_col="tfidf_terms",
                 seed=222)

model_rf.tfidf <- ml_fit(pipline2, train)

ml_transform(model_rf.tfidf,test)%>%
  ml_binary_classification_evaluator(label_col = "class",
                                     metric_name= "areaUnderROC")

## [1] 0.953212

Using this more complex model than the previous one is not justified for this data since their rates are close to each other.

7 Add new features

Customizing new features from the data that we think they are more relevant than the old ones is a popular strategy used to improve prediction quality. For example, with our data we think that spam messages tend to be shorter than ham messages, we can, thus, add the messages’ lengths as new features.

train1 <- train %>% mutate(lengths=nchar(cleaned))
test1 <- test %>% mutate(lengths=nchar(cleaned))

Now let’s retrain the above models again with this new added feature.

7.1 TF model

pipline_tf<-ml_pipeline(sc)%>%
  ft_string_indexer(input_col = "labels",output_col="class")%>%
  ft_tokenizer(input_col = "cleaned", output_col="words")%>%
  ft_stop_words_remover(input_col = "words",output_col = "cleaned_words")%>%
  ft_count_vectorizer(input_col = "cleaned_words", output_col="terms",
                      min_df=5,binary=TRUE)%>%
  ft_vector_assembler(input_cols = c("terms","lengths"),output_col="features")%>%
  ml_random_forest_classifier(label_col="class",
                 features_col="features",
                 seed=222)

model_rf_new<-ml_fit(pipline_tf,train1)
ml_transform(model_rf_new,test1)%>%
  ml_binary_classification_evaluator(label_col = "class",
                                     metric_name= "areaUnderROC")

## [1] 0.9849365

Fortunately, our expectation about this new feature is confirmed since we have got a significant improvement compared to the previous results.

7.2 tf_idf model

pipline_tfidf<-ml_pipeline(sc)%>%
  ft_string_indexer(input_col = "labels",output_col="class")%>%
  ft_tokenizer(input_col = "cleaned", output_col="words")%>%
  ft_stop_words_remover(input_col = "words",output_col = "cleaned_words")%>%
  ft_count_vectorizer(input_col = "cleaned_words", output_col="tf_terms")%>%
  ft_idf(input_col = "tf_terms", output_col="tfidf_terms")%>%
  ft_vector_assembler(input_cols = c("tfidf_terms","lengths"),output_col="features")%>%
    ml_random_forest_classifier(label_col="class",
                 features_col="features",
                 seed=222)

model_rf_new2 <- ml_fit(pipline_tfidf, train1)

ml_transform(model_rf_new2,test1)%>%
  ml_binary_classification_evaluator(label_col = "class",
                                     metric_name= "areaUnderROC")

## [1] 0.9857918

Again, as we said before, the use of idf method is not justified, and it would be better to stay with the tf method.

8 n-gram model

In contrast to the function ft_tokenizer that splits the text into tokens where each token has a single word, each token resulted from the sparklyr function ft_ngram has n words respecting the same appearance order as in the original text. To well understand let’s take the following example.

data <- copy_to(sc, data.frame(x="I like both R and python"), overwrite = TRUE)
data

## # Source: spark<?> [?? x 1]
##   x                       
##   <chr>                   
## 1 I like both R and python

the ft_tokenizer function gives the following tokens:

ft_tokenizer(data, "x", "y") %>% 
  mutate(y1=explode(y)) %>% select(y1)

## # Source: spark<?> [?? x 1]
##   y1    
##   <chr> 
## 1 i     
## 2 like  
## 3 both  
## 4 r     
## 5 and   
## 6 python

Whereas, with ft_ngram, where $n=2$ we get the following tokens

data  %>%  ft_tokenizer("x", "y") %>% 
  ft_ngram("y", "y1", n=2) %>%
  mutate(z=explode(y1)) %>% 
  select(z)

## # Source: spark<?> [?? x 1]
##   z         
##   <chr>     
## 1 i like    
## 2 like both 
## 3 both r    
## 4 r and     
## 5 and python

Now let’s train 2_gram Random forest model.

pipline_2gram<-ml_pipeline(sc)%>%
  ft_string_indexer(input_col = "labels",output_col="class")%>%
  ft_tokenizer(input_col = "cleaned", output_col="words")%>%
  ft_stop_words_remover(input_col = "words",output_col = "cleaned_words")%>%
  ft_ngram(input_col = "cleaned_words", output_col="ngram_words", n=2) %>% 
  ft_count_vectorizer(input_col = "ngram_words", output_col="tf_terms")%>%
  ft_vector_assembler(input_cols = c("tf_terms","lengths"),output_col="features")%>%
  ml_random_forest_classifier(label_col="class",
                 features_col="features",
                 seed=222)

model_rf_2gram <- ml_fit(pipline_2gram, train1)

ml_transform(model_rf_2gram,test1)%>%
  ml_binary_classification_evaluator(label_col = "class",
                                     metric_name= "areaUnderROC")

## [1] 0.8835537

You should know that this function takes only tokens with tow words exactly, not tokens with less or equal 2 words. That is why we have obtained a lower rate than the previous models.

When you are satisfied by your final model, you can save it for further use as follows.

#ml_save(model_rf_ngram,"spark_ngram",overwrite = TRUE)

The last thing to mention, is when you finish your work do not forget to free your resources by disconnecting from spark as follows

spark_disconnect(sc)

9 Conclusion:

This article is a brief introduction to illustrate how easy to handle and model large data set with the combination of the two powerful languages R and spark. we have used a text data set since this type of data that characterizes the most large datasets encountered in the real world.

10 References

Brett Lantz (2016). Machine learning with R. packet publishing. Second edition. ISBN 97-8-1-78439-390-8.
Garet James et al (2013) , An introduction to statistical learning, springer, ISBN 978-1-4614-7138-7.
Javier Luraschi (2014). Mastering spark with R. O’reilly. https://therinspark.com/intro.html
Kevin P,Murphy (2012). Machine learning: A probabilistic perspective. The MIT press, ISBN 978-0-262-01802-9.
Spark Jones.K (1972). A Statistical Interpretation of Term Specificity and Its Application in Retrieval. Journal of Documentation. 28: 11–21.
https://www.kaggle.com/team-ai/spam-text-message-classification.
https://www.tidyverse.org/packages/https://rstudio.com/wp-content/uploads/2016/09/RegExCheatsheet.pdf

11 Session information

sessionInfo()

## R version 4.0.1 (2020-06-06)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 19041)
## 
## Matrix products: default
## 
## locale:
## [1] LC_COLLATE=English_United States.1252 
## [2] LC_CTYPE=English_United States.1252   
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C                          
## [5] LC_TIME=English_United States.1252    
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] forcats_0.5.0   stringr_1.4.0   dplyr_1.0.2     purrr_0.3.4    
##  [5] readr_1.3.1     tidyr_1.1.2     tibble_3.0.3    ggplot2_3.3.2  
##  [9] tidyverse_1.3.0 sparklyr_1.4.0 
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_1.0.5         lubridate_1.7.9    forge_0.2.0        utf8_1.1.4        
##  [5] assertthat_0.2.1   rprojroot_1.3-2    digest_0.6.25      slam_0.1-47       
##  [9] R6_2.4.1           cellranger_1.1.0   backports_1.1.10   reprex_0.3.0      
## [13] evaluate_0.14      httr_1.4.2         highr_0.8          blogdown_0.20     
## [17] pillar_1.4.6       rlang_0.4.7        readxl_1.3.1       uuid_0.1-4        
## [21] rstudioapi_0.11    blob_1.2.1         rmarkdown_2.4      config_0.3        
## [25] r2d3_0.2.3         htmlwidgets_1.5.2  munsell_0.5.0      broom_0.7.1       
## [29] compiler_4.0.1     modelr_0.1.8       xfun_0.18          pkgconfig_2.0.3   
## [33] askpass_1.1        base64enc_0.1-3    htmltools_0.5.0    openssl_1.4.3     
## [37] tidyselect_1.1.0   bookdown_0.20      fansi_0.4.1        crayon_1.3.4      
## [41] dbplyr_1.4.4       withr_2.3.0        grid_4.0.1         jsonlite_1.7.1    
## [45] gtable_0.3.0       lifecycle_0.2.0    DBI_1.1.0          magrittr_1.5      
## [49] scales_1.1.1       cli_2.0.2          stringi_1.5.3      fs_1.5.0          
## [53] NLP_0.2-0          xml2_1.3.2         ellipsis_0.3.1     generics_0.0.2    
## [57] vctrs_0.3.4        wordcloud_2.6      RColorBrewer_1.1-2 tools_4.0.1       
## [61] glue_1.4.2         hms_0.5.3          parallel_4.0.1     yaml_2.2.1        
## [65] tm_0.7-7           colorspace_1.4-1   rvest_0.3.6        knitr_1.30        
## [69] haven_2.3.1

Ordinal data models

Tue, 09 Jun 2020 00:00:00 +0000

1 Introduction

This tutorial aims to explore the most popular models used to predict an ordered response variable. We will use the heart disease data uploaded from kaggle website, where our response will be the chest pain cp variable instead of the target variable used usually.

2 Data preparation

First, we call the data and the libraries that we need along with this illustration as follows.

options(warn = -1)
library(tidyverse)
library(caret)
library(tidymodels)
mydata<-read.csv("../heart.csv",header = TRUE)
names(mydata)[1]<-"age"

The data at hand has the following features:

age.
sex: 1=male,0=female
cp : chest pain type.
trestbps : resting blood pressure.
chol: serum cholestoral.
fbs : fasting blood sugar.
restecg : resting electrocardiographic results.
thalach : maximum heart rate achieved
exang : exercise induced angina.
oldpeak : ST depression induced by exercise relative to rest.
slope : the slope of the peak exercise ST segment.
ca : number of major vessels colored by flourosopy.
thal : it is not well defined from the data source.
target: have heart disease or not.

I think the best start to explore the summary of all predictors and missing values is by using the powerful function skim from skimr package.

skimr::skim(mydata)

Table 2.1: Data summary
Name	mydata
Number of rows	303
Number of columns	14
_______________________
Column type frequency:
numeric	14
________________________
Group variables	None

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
age	1	54.37	9.08	29	47.5	55.0	61.0	77.0	▁▆▇▇▁
sex	1	0.68	0.47	0	0.0	1.0	1.0	1.0	▃▁▁▁▇
cp	1	0.97	1.03	0	0.0	1.0	2.0	3.0	▇▃▁▅▁
trestbps	1	131.62	17.54	94	120.0	130.0	140.0	200.0	▃▇▅▁▁
chol	1	246.26	51.83	126	211.0	240.0	274.5	564.0	▃▇▂▁▁
fbs	1	0.15	0.36	0	0.0	0.0	0.0	1.0	▇▁▁▁▂
restecg	1	0.53	0.53	0	0.0	1.0	1.0	2.0	▇▁▇▁▁
thalach	1	149.65	22.91	71	133.5	153.0	166.0	202.0	▁▂▅▇▂
exang	1	0.33	0.47	0	0.0	0.0	1.0	1.0	▇▁▁▁▃
oldpeak	1	1.04	1.16	0	0.0	0.8	1.6	6.2	▇▂▁▁▁
slope	1	1.40	0.62	0	1.0	1.0	2.0	2.0	▁▁▇▁▇
ca	1	0.73	1.02	0	0.0	0.0	1.0	4.0	▇▃▂▁▁
thal	1	2.31	0.61	0	2.0	2.0	3.0	3.0	▁▁▁▇▆
target	1	0.54	0.50	0	0.0	1.0	1.0	1.0	▇▁▁▁▇

For our case we will use the chest pain type cp variable as our target variable since it is a categorical variable. However, for pedagogic purposes, we will manipulate it so that it will be an ordered factor with only three levels no pain,moderate pain, severe pain (instead of 4 levels now).

Looking at the above output, we convert the variables that should be of factor type, which are: sex, target, fbs, resecg, exang, slope, ca, thal. For the response variable cp, we drop its less frequently level with all its related rows, then we rename the remaining ones as no pain for the most frequently one, severe pain for the less frequently one, and moderate pain for the last one.

table(mydata$cp)


  0   1   2   3 
143  50  87  23

we see the level 3 is the less frequently one.

mydata<-mydata %>%
  modify_at(c("cp", "sex", "target", "fbs", "resecg", "exang", "slope", "ca", "thal"),
            as.factor)
mydata<-mydata[mydata$cp!=3,]
mydata$cp<-fct_drop(mydata$cp,only=levels(mydata$cp))
table(mydata$cp)


  0   1   2 
143  50  87

According to these frequencies we rename and we order the levels as follows.

mydata$cp<-fct_recode(mydata$cp,no="0",sev="1",mod="2")
mydata$cp<-factor(mydata$cp,ordered = TRUE)
mydata$cp<-fct_infreq(mydata$cp)
mydata$cp[1:5]

[1] mod sev sev no  no 
Levels: no < mod < sev

Similar to the logistic regression, the number of cases in each cell from each cross table between the outcome and each factor should be above the threshold of 5 applied in practice.

xtabs(~cp+sex,data=mydata)

     sex
cp      0   1
  no   39 104
  mod  35  52
  sev  18  32

xtabs(~cp+target,data=mydata)

     target
cp      0   1
  no  104  39
  mod  18  69
  sev   9  41

xtabs(~cp+fbs,data=mydata)

     fbs
cp      0   1
  no  125  18
  mod  70  17
  sev  45   5

xtabs(~cp+restecg,data=mydata)

     restecg
cp     0  1  2
  no  78 62  3
  mod 36 50  1
  sev 19 31  0

xtabs(~cp+exang,data=mydata)

     exang
cp     0  1
  no  63 80
  mod 76 11
  sev 46  4

xtabs(~cp+slope,data=mydata)

     slope
cp     0  1  2
  no  11 84 48
  mod  5 33 49
  sev  2 12 36

xtabs(~cp+ca,data=mydata)

     ca
cp     0  1  2  3  4
  no  65 34 29 14  1
  mod 57 20  2  5  3
  sev 37  8  3  1  1

xtabs(~cp+thal,data=mydata)

     thal
cp     0  1  2  3
  no   1 12 52 78
  mod  1  2 62 22
  sev  0  2 39  9

The following variables do not respect this threshold and hence they will be removed from the predictors set: restecg, exang, slope, ca, and thal.

mydata<-mydata[,setdiff(names(mydata), 
                        c("restecg", "exang", "slope", "ca",  "thal"))]

The data is ready and we can now split the data between training and testing set.

set.seed(1122)
parts <- initial_split(mydata, prop=0.8, strata = cp)
train <- training(parts)
test <- testing(parts)

The most popular models that we will use are: ordinal logistic model, cart model, ordinal random forest model, Continuation ratio model.

3 Ordered logistic regression model (logit)

Before training this type of model let’s show how it works. For simplicity suppose we have data that has an ordered outcome $y$ with three class labels (“1”,“2”,“3”) and only two features $x_1$ and $x_2$.

First we define a latent variable as a linear combination of the features:

\[\begin{equation} y_i^*=\beta_1 X_{i1}+\beta_2 X_{i2} \end{equation}\]

Then since we have three classes we define two thresholds for this latent variable $\alpha_1$ and $\alpha_2$ such that a particular observation $y_i$ will be classified as follows:

\[\begin{cases} y_i=1 & \text{if $y_i^* \leq \alpha_1$} \\ y_i=2 & \text{if $\alpha_1 < y_i^* \leq \alpha_2$} \\ y_i=3 & \text{if $y_i^* > \alpha_2$} \end{cases}\]

Now we can obtain the probability of a particular observation to fall into a specific class as follows:

\[\begin{cases} p(y_i=1)=p(y_i^* \leq \alpha_1)=F(\alpha_1-\beta_1 X_{i1}-\beta_2 X_{i2}) \\ p(y_i=2)=p(\alpha_1 < y_i^* \leq \alpha_2)=F(\alpha_2-\beta_1 X_{i1}-\beta_2 X_{i2})-F(\alpha_1-\beta_1 X_{i1}-\beta_2 X_{i2}) \\ p(y_i=3)=1-p(y_i=2)-p(y_i=1)\end{cases}\]

It remains now to define the suitable distribution function F. There are two commonly used ones for this type of data, the logit function $F(x)=\frac{1}{1+exp^{-x}}$ and the normal distribution function aka probit.

Note: there exist other functions like loglog, cloglog, and cauchit.

Using the logit function the probabilities will be.

\[\begin{cases} p(y_i=1)=\frac{1}{1+exp^{-(\alpha_1-\beta_1 X_{i1}-\beta_2 X_{i2})}} \\ p(y_i=2)=\frac{1}{1+exp^{-(\alpha_2-\beta_1 X_{i1}-\beta_2 X_{i2})}}-p(y_i=1) \\ p(y_i=3)=1-p(y_i=2)-p(y_i=1)\end{cases}\]

The MASS package provides the method polr to perform an ordinal logistic regression.

library(MASS)
set.seed(1234)
model_logistic<-train(cp~., data=train,
                      method="polr",
                      tuneGrid=expand.grid(method="logistic"))

summary(model_logistic)


Coefficients:
              Value Std. Error  t value
age       0.0112236   0.018219  0.61605
sex1      0.2593720   0.316333  0.81993
trestbps -0.0002329   0.009090 -0.02562
chol     -0.0013238   0.002697 -0.49082
fbs1      0.3188826   0.401836  0.79356
thalach   0.0226246   0.008199  2.75933
oldpeak  -0.3360326   0.163547 -2.05465
target1   1.7234740   0.376279  4.58031

Intercepts:
        Value   Std. Error t value
no|mod   4.5786  1.9271     2.3759
mod|sev  6.5004  1.9527     3.3289

Residual Deviance: 376.4697 
AIC: 396.4697

This table does not provide the p-values. However, it is not a big problem since we can add the p_values by the following script.

prob <- pnorm(abs(summary(model_logistic)$coefficients[,3]),lower.tail = FALSE)*2
cbind(summary(model_logistic)$coefficients,prob)

                 Value  Std. Error     t value         prob
age       0.0112236479 0.018218848  0.61604597 5.378642e-01
sex1      0.2593719567 0.316332564  0.81993442 4.122535e-01
trestbps -0.0002329023 0.009090066 -0.02562163 9.795591e-01
chol     -0.0013237835 0.002697079 -0.49082122 6.235529e-01
fbs1      0.3188825831 0.401836034  0.79356393 4.274493e-01
thalach   0.0226246089 0.008199317  2.75932853 5.792027e-03
oldpeak  -0.3360326371 0.163547467 -2.05464899 3.991292e-02
target1   1.7234739863 0.376278770  4.58031152 4.642839e-06
no|mod    4.5785821473 1.927119568  2.37586822 1.750771e-02
mod|sev   6.5003986218 1.952726089  3.32888399 8.719471e-04

Using the threshold p-value 0.05, we remove the non significant variables. age, trestbps, chol.

set.seed(1234)
model_logistic<-train(cp~.-age-trestbps-chol, data=train,
                      method="polr",tuneGrid=expand.grid(method="logistic"))
prob <- pnorm(abs(summary(model_logistic)$coefficients[,3]),lower.tail = FALSE)*2
cbind(summary(model_logistic)$coefficients,prob)

              Value  Std. Error    t value         prob
sex1     0.25427581 0.308143065  0.8251875 4.092651e-01
fbs1     0.37177505 0.384667871  0.9664832 3.338024e-01
thalach  0.02050951 0.007487511  2.7391620 6.159602e-03
oldpeak -0.33669473 0.161699555 -2.0822242 3.732199e-02
target1  1.71338020 0.369558584  4.6362885 3.547208e-06
no|mod   4.00836398 1.143111953  3.5065367 4.539789e-04
mod|sev  5.92987585 1.185074388  5.0038005 5.621092e-07

Notice that we do not remove the factors sex and fbs even they are not significant due to the significance of the intercepts.

To well understand these coefficients lets restrict the model with only two predictors.

set.seed(1234)
model1<-train(cp~target+thalach, 
              data=train,
              method = "polr",
              tuneGrid=expand.grid(method="logistic"))
summary(model1)


Coefficients:
          Value Std. Error t value
target1 1.87953   0.333153   5.642
thalach 0.02347   0.007372   3.184

Intercepts:
        Value  Std. Error t value
no|mod  4.6457 1.0799     4.3018 
mod|sev 6.5325 1.1271     5.7959 

Residual Deviance: 383.3144 
AIC: 391.3144

Let’s plug in these coefficients in the above equations we obtain the probability of each class as follows:

\[\begin{cases} p(no)=\frac{1}{1+exp^{-(4.6457-1.87953X_{i1}-0.02347X_{i2})}} \\ p(mod)=\frac{1}{1+exp^{-(6.5325-1.87953X_{i1}-0.02347X_{i2})}}-p(no) \\ p(sev)=1-p(mod)-p(no)\end{cases}\]

Let’s now predict a particular patient, say the third one.

train[3,c("cp","thalach","target")]

   cp thalach target
4 sev     178      1

We plug in the predictor values as follows:

\[\begin{cases} p(no)=\frac{1}{1+exp^{-(4.6457-1.87953*1-0.02347*178)}} \\ p(mod)=\frac{1}{1+exp^{-(6.5325-1.87953*1-0.02347*178)}}-p(no) \\ p(sev)=1-p(mod)-p(no)\end{cases}=\begin{cases} p(no)=0.1959992 \\ p(mod)=0.6166398-0.1959992=0.4206406 \\ p(sev)=1-0.4206406-0.1959992=0.3833602\end{cases}\]

Using the highest probability this patient will be predicted to have mod pain. Now let’s compare these probabilities with those obtained from function predict.

predict(model1, train[1:3,], type = "prob") %>% tail(1)

         no       mod      sev
4 0.1958709 0.4205981 0.383531

Now we go back to our original model and compute the accuracy rate for the training data.

predict(model_logistic, train) %>% 
  bind_cols(train) %>%
  rename(pred="...1", truth=cp) %>% 
  accuracy(pred, truth)

# A tibble: 1 x 3
  .metric  .estimator .estimate
  <chr>    <chr>          <dbl>
1 accuracy multiclass     0.611

with the logistic regression model we get 61% accuracy for the training set, which is quite bad. So let’s test the model using the testing set now.

predict(model_logistic, test) %>% 
  bind_cols(test) %>%
  rename(pred="...1", truth=cp) %>% 
  accuracy(pred, truth)

# A tibble: 1 x 3
  .metric  .estimator .estimate
  <chr>    <chr>          <dbl>
1 accuracy multiclass     0.648

Surprisingly, the accuracy rate for the testing set is about 65%, which is larger than that computed from the training data (61%). This is an indication of an underfitting problem (The opposite effect of overfitting problem). Is there any way to improve the model performance? Maybe yes, by going back and tune some hyperparameters, but since we have an underfitting problem we do not have much hyperparameters for this model except the type of function used which is by default the logistic function, but there exist as well other functions like probit, loglog, …ect.

For our case let’s try this model with the probit function

4 Ordinal logistic rgeression model (probit)

set.seed(1234)
model_probit<-train(cp~.-age-trestbps-chol, data=train,                                        method="polr",
                    tuneGrid=expand.grid(method="probit"))

predict(model_probit, train) %>% 
  bind_cols(train) %>%
  rename(pred="...1", truth=cp) %>% 
  accuracy(pred, truth)

# A tibble: 1 x 3
  .metric  .estimator .estimate
  <chr>    <chr>          <dbl>
1 accuracy multiclass     0.606

This rate is slightly worse than that from the previous model. But what about the testing set.

predict(model_probit, test) %>% 
  bind_cols(test) %>%
  rename(pred="...1", truth=cp) %>% 
  accuracy(pred, truth)

# A tibble: 1 x 3
  .metric  .estimator .estimate
  <chr>    <chr>          <dbl>
1 accuracy multiclass     0.593

This one also is worse than the previous model. So this means that the logistic function for this data performs better than the probit one.

When we try many things to improve the model performance and we do not gain much, it will be better to think to try different types of models.

5 CART model

This is a tree-based model used both for classification and regression. To train this model we make use of rpartScore package, and for simplification, we will include only the significant predictors from the previous model.

library(rpartScore)
set.seed(1234)
model_cart<-train(cp~.-age-trestbps-chol, data=train,
                      method="rpartScore")
model_cart

CART or Ordinal Responses 

226 samples
  8 predictor
  3 classes: 'no', 'mod', 'sev' 

No pre-processing
Resampling: Bootstrapped (25 reps) 
Summary of sample sizes: 226, 226, 226, 226, 226, 226, ... 
Resampling results across tuning parameters:

  cp          split  prune  Accuracy   Kappa    
  0.02702703  abs    mr     0.5748197  0.2845545
  0.02702703  abs    mc     0.5796085  0.3011122
  0.02702703  quad   mr     0.5711605  0.2764466
  0.02702703  quad   mc     0.5805216  0.3020125
  0.04504505  abs    mr     0.5620975  0.2719646
  0.04504505  abs    mc     0.5966801  0.3274893
  0.04504505  quad   mr     0.5592845  0.2608402
  0.04504505  quad   mc     0.5930817  0.3208220
  0.21621622  abs    mr     0.5303342  0.1266324
  0.21621622  abs    mc     0.6004116  0.3343997
  0.21621622  quad   mr     0.5290009  0.1143360
  0.21621622  quad   mc     0.5928132  0.3225686

Accuracy was used to select the optimal model using the largest value.
The final values used for the model were cp = 0.2162162, split = abs and
 prune = mc.

The caret model uses the bootstrapping technique for hyperparameters tuning. In our case, the largest accuracy rate is about 59.59%, with the complexity parameter **cp**=0.2162162, the **split**=abs, and **prune**= **mc**.

The argument split controls the splitting function used to grow the tree by setting the misclassification costs in the generalized Gini impurity function to the absolute abs or squared quad. The argument prune is used to select the performance measure to prune the tree between total misclassification rate mr or misclassification cost mc.

predict(model_cart, train) %>% 
  bind_cols(train) %>%
  rename(pred="...1", truth=cp) %>% 
  accuracy(pred, truth)

# A tibble: 1 x 3
  .metric  .estimator .estimate
  <chr>    <chr>          <dbl>
1 accuracy multiclass     0.615

Surprisingly, we get approximately the same accuracy rate as the logit model. Let’s check the testing set.

predict(model_cart, test) %>% 
  bind_cols(test) %>%
  rename(pred="...1", truth=cp) %>% 
  accuracy(pred, truth)

# A tibble: 1 x 3
  .metric  .estimator .estimate
  <chr>    <chr>          <dbl>
1 accuracy multiclass     0.630

Now wit this model we get a lower accuracy rate than that of the logistic model.

6 Ordinal Random forst model.

This model is a corrected version of random forest model that takes into account the ordinal nature of the response variable. For more detail about this model read this great paper.

To train ordinal random forest model, we need to call the following packages: e1071, ranger, ordinalForest.

library(ordinalForest)
library(ranger)
library(e1071)

Since the create function train use bootstrapping method to perform hyperparameters tuning to choose the best values, this makes the training process very slow, that is why i save the resulted output and load it again

# set.seed(1234)
# model_forest<-train(cp~.-age-trestbps-chol, data=train,
#                       method='ordinalRF')

# saveRDS(model_forest, #"C://Users/dell/Documents/new-blog/content/post/ordinal/model_forest.rds")

model_forest <- readRDS("C://Users/dell/Documents/new-blog/content/post/ordinal/model_forest.rds")

model_forest

Random Forest 

226 samples
  8 predictor
  3 classes: 'no', 'mod', 'sev' 

No pre-processing
Resampling: Bootstrapped (25 reps) 
Summary of sample sizes: 226, 226, 226, 226, 226, 226, ... 
Resampling results across tuning parameters:

  nsets  ntreeperdiv  ntreefinal  Accuracy   Kappa    
   50     50          200         0.5808002  0.3008422
   50     50          400         0.5776249  0.2954635
   50     50          600         0.5802381  0.3009845
   50    100          200         0.5805333  0.2982787
   50    100          400         0.5835550  0.3046105
   50    100          600         0.5792347  0.2966789
   50    150          200         0.5781306  0.2957198
   50    150          400         0.5763106  0.2929363
   50    150          600         0.5773418  0.2939428
  100     50          200         0.5825633  0.3037443
  100     50          400         0.5766958  0.2946094
  100     50          600         0.5801625  0.2992074
  100    100          200         0.5817261  0.3017512
  100    100          400         0.5802315  0.2984311
  100    100          600         0.5760195  0.2936909
  100    150          200         0.5791770  0.2986367
  100    150          400         0.5773527  0.2940674
  100    150          600         0.5800019  0.2990121
  150     50          200         0.5738722  0.2890697
  150     50          400         0.5755389  0.2915668
  150     50          600         0.5793087  0.2994984
  150    100          200         0.5821339  0.3039247
  150    100          400         0.5810183  0.3003594
  150    100          600         0.5797573  0.3001752
  150    150          200         0.5792505  0.2992324
  150    150          400         0.5757645  0.2930867
  150    150          600         0.5802099  0.2993488

Accuracy was used to select the optimal model using the largest value.
The final values used for the model were nsets = 50, ntreeperdiv = 100
 and ntreefinal = 400.

We can plot the important predictors as follows.

plot(varImp(model_forest))

Now we can obtain the accuracy rate for the training rate as follows.

predict(model_forest, train) %>% 
  bind_cols(train) %>%
  rename(pred="...1", truth=cp) %>% 
  accuracy(pred, truth)

# A tibble: 1 x 3
  .metric  .estimator .estimate
  <chr>    <chr>          <dbl>
1 accuracy multiclass     0.819

Great!, with this model, the accuracy rate has largely improved to roughly 84%. But wait, what matters is the accuracy of the testing set.

predict(model_forest, test) %>% 
  bind_cols(test) %>%
  rename(pred="...1", truth=cp) %>% 
  accuracy(pred, truth)

# A tibble: 1 x 3
  .metric  .estimator .estimate
  <chr>    <chr>          <dbl>
1 accuracy multiclass     0.574

This is exactly what is called the overfitting problem. The model generalizes poorly to new unseen data. We can go back and tune some other hyperparameters like increasing the minimum size of nodes (default is 5) to fight the overfitting problem. we do not, however, do that here since it is not the purpose of this tutorial.

7 Continuation Ratio Model

This model uses The vector generalized additive models which are available in the VGAM package. for more detail about these models click here.

library(VGAM)
set.seed(1234)
model_vgam<-train(cp~.-age-trestbps-chol, data=train,
                  method="vglmContRatio", trace=FALSE)

model_vgam

Continuation Ratio Model for Ordinal Data 

226 samples
  8 predictor
  3 classes: 'no', 'mod', 'sev' 

No pre-processing
Resampling: Bootstrapped (25 reps) 
Summary of sample sizes: 226, 226, 226, 226, 226, 226, ... 
Resampling results across tuning parameters:

  parallel  link     Accuracy   Kappa    
  FALSE     logit    0.5962581  0.3323075
  FALSE     probit   0.5942637  0.3302998
  FALSE     cloglog  0.5973844  0.3293056
  FALSE     cauchit  0.5967368  0.3316896
  FALSE     logc     0.5945121  0.3152759
   TRUE     logit    0.5758330  0.2961673
   TRUE     probit   0.5738297  0.2924747
   TRUE     cloglog  0.5838764  0.3014038
   TRUE     cauchit  0.5810184  0.3067004
   TRUE     logc     0.5302522  0.1031624

Accuracy was used to select the optimal model using the largest value.
The final values used for the model were parallel = FALSE and link = cloglog.

the best model is obtained when the argument parallel is FALSE and link is cauchit which is the tangent function.

The accuracy rate of the training data is:

predict(model_vgam, train) %>% 
  bind_cols(train) %>%
  rename(pred="...1", truth=cp) %>% 
  accuracy(pred, truth)

# A tibble: 1 x 3
  .metric  .estimator .estimate
  <chr>    <chr>          <dbl>
1 accuracy multiclass     0.659

And the accuracy of the testing set is:

predict(model_vgam, test) %>% 
  bind_cols(test) %>%
  rename(pred="...1", truth=cp) %>% 
  accuracy(pred, truth)

# A tibble: 1 x 3
  .metric  .estimator .estimate
  <chr>    <chr>          <dbl>
1 accuracy multiclass     0.630

This the best accuracy rate compared to the other models.

8 Compare models

We can compare between the above models using resample caret function.

models_eval<-resamples(list(logit=model_logistic,
                            cart=model_cart,
                            forest=model_forest,
                            vgam=model_vgam))
summary(models_eval)


Call:
summary.resamples(object = models_eval)

Models: logit, cart, forest, vgam 
Number of resamples: 25 

Accuracy 
            Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
logit  0.5060241 0.5731707 0.5822785 0.5871083 0.6097561 0.6627907    0
cart   0.3734940 0.5824176 0.6097561 0.6004116 0.6279070 0.6746988    0
forest 0.4891304 0.5609756 0.5853659 0.5835550 0.6162791 0.6385542    0
vgam   0.4936709 0.5760870 0.6046512 0.5973844 0.6202532 0.6626506    0

Kappa 
               Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
logit   0.189086980 0.2792369 0.3144822 0.3100458 0.3437500 0.4512651    0
cart   -0.004889406 0.3185420 0.3474144 0.3343997 0.3775576 0.4526136    0
forest  0.186912373 0.2719432 0.3091678 0.3046105 0.3464604 0.4011544    0
vgam    0.144558744 0.2993406 0.3367647 0.3293056 0.3690791 0.4142980    0

Based on the training set and using the mean of the accuracy rate we can say that cart model is the best model for this data with 60.97% accuracy for the training set. However, things are different when it comes to use the testing set instead.

tibble(models=c("logit", "cart", "forest", "vgam"), 
       accuracy=c(
  predict(model_logistic, test) %>% 
  bind_cols(test) %>%
  rename(pred="...1", truth=cp) %>% 
  accuracy(pred, truth) %>% 
  .[, ".estimate"],
  predict(model_cart, test) %>% 
  bind_cols(test) %>%
  rename(pred="...1", truth=cp) %>% 
  accuracy(pred, truth) %>% 
  .[, ".estimate"],
  predict(model_forest, test) %>% 
  bind_cols(test) %>%
  rename(pred="...1", truth=cp) %>% 
  accuracy(pred, truth) %>% 
  .[, ".estimate"],
  predict(model_vgam, test) %>% 
  bind_cols(test) %>%
  rename(pred="...1", truth=cp) %>% 
  accuracy(pred, truth) %>% 
  .[, ".estimate"])) %>% 
  unnest()

# A tibble: 4 x 2
  models accuracy
  <chr>     <dbl>
1 logit     0.648
2 cart      0.630
3 forest    0.574
4 vgam      0.630

Using the testing set, the logistic model with the link logit is the best model to predict this data.

9 Conclusion

We have seen so far how to model ordinal data by exploring several models, and it happened that the logistic model is the best on for our data. However, in general the best model depends strongly on the data at hand.

10 Session information

sessionInfo()

R version 4.0.1 (2020-06-06)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19041)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252 
[2] LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] splines   stats4    stats     graphics  grDevices utils     datasets 
[8] methods   base     

other attached packages:
 [1] VGAM_1.1-3          e1071_1.7-3         ranger_0.12.1      
 [4] ordinalForest_2.4-1 rpartScore_1.0-1    rpart_4.1-15       
 [7] MASS_7.3-53         yardstick_0.0.7     workflows_0.2.0    
[10] tune_0.1.1          rsample_0.0.8       recipes_0.1.13     
[13] parsnip_0.1.3       modeldata_0.0.2     infer_0.5.3        
[16] dials_0.0.9         scales_1.1.1        broom_0.7.1        
[19] tidymodels_0.1.1    caret_6.0-86        lattice_0.20-41    
[22] forcats_0.5.0       stringr_1.4.0       dplyr_1.0.2        
[25] purrr_0.3.4         readr_1.3.1         tidyr_1.1.2        
[28] tibble_3.0.3        ggplot2_3.3.2       tidyverse_1.3.0    

loaded via a namespace (and not attached):
 [1] colorspace_1.4-1     ellipsis_0.3.1       class_7.3-17        
 [4] base64enc_0.1-3      fs_1.5.0             rstudioapi_0.11     
 [7] listenv_0.8.0        furrr_0.1.0          prodlim_2019.11.13  
[10] fansi_0.4.1          lubridate_1.7.9      xml2_1.3.2          
[13] codetools_0.2-16     knitr_1.30           jsonlite_1.7.1      
[16] pROC_1.16.2          dbplyr_1.4.4         compiler_4.0.1      
[19] httr_1.4.2           backports_1.1.10     assertthat_0.2.1    
[22] Matrix_1.2-18        cli_2.0.2            htmltools_0.5.0     
[25] tools_4.0.1          gtable_0.3.0         glue_1.4.2          
[28] reshape2_1.4.4       Rcpp_1.0.5           cellranger_1.1.0    
[31] DiceDesign_1.8-1     vctrs_0.3.4          nlme_3.1-149        
[34] blogdown_0.20        iterators_1.0.12     timeDate_3043.102   
[37] gower_0.2.2          xfun_0.18            globals_0.13.0      
[40] rvest_0.3.6          lifecycle_0.2.0      future_1.19.1       
[43] ipred_0.9-9          hms_0.5.3            parallel_4.0.1      
[46] yaml_2.2.1           stringi_1.5.3        highr_0.8           
[49] foreach_1.5.0        lhs_1.1.0            lava_1.6.8          
[52] repr_1.1.0           rlang_0.4.7          pkgconfig_2.0.3     
[55] evaluate_0.14        tidyselect_1.1.0     plyr_1.8.6          
[58] magrittr_1.5         bookdown_0.20        R6_2.4.1            
[61] generics_0.0.2       DBI_1.1.0            pillar_1.4.6        
[64] haven_2.3.1          withr_2.3.0          survival_3.2-7      
[67] nnet_7.3-14          modelr_0.1.8         crayon_1.3.4        
[70] utf8_1.1.4           rmarkdown_2.4        grid_4.0.1          
[73] readxl_1.3.1         data.table_1.13.0    blob_1.2.1          
[76] ModelMetrics_1.2.2.2 reprex_0.3.0         digest_0.6.25       
[79] munsell_0.5.0        GPfit_1.0-8          skimr_2.1.2

Predicting binary response variable with h2o framework

Wed, 03 Jun 2020 00:00:00 +0000

1 Introduction

H2O is an open-source distributed scalable framework used to train machine learning and deep learning models as well as data analysis. It can handle large data sets, with ease of use, by creating a cluster from the available nodes. Fortunately, it provides an API for R users to get the most benefits from it, especially when it comes to large data sets, with which R has its most limitations.

The beauty is that R users can load and use this system via the package h2o which can be called and used like any other R packages.

# install.packages("h2o") if not already installed
library(tidyverse)

-- Attaching packages -------------

v ggplot2 3.3.2     v purrr   0.3.4
v tibble  3.0.3     v dplyr   1.0.2
v tidyr   1.1.2     v stringr 1.4.0
v readr   1.3.1     v forcats 0.5.0

Warning: package 'ggplot2' was built under R version 4.0.2

Warning: package 'tibble' was built under R version 4.0.2

Warning: package 'tidyr' was built under R version 4.0.2

Warning: package 'dplyr' was built under R version 4.0.2

-- Conflicts ----------------------
x dplyr::filter() masks stats::filter()
x dplyr::lag()    masks stats::lag()

library(h2o)


----------------------------------------------------------------------

Your next step is to start H2O:
    > h2o.init()

For H2O package documentation, ask for help:
    > ??h2o

After starting H2O, you can use the Web UI at http://localhost:54321
For more information visit https://docs.h2o.ai

----------------------------------------------------------------------


Attaching package: 'h2o'

The following objects are masked from 'package:stats':

    cor, sd, var

The following objects are masked from 'package:base':

    %*%, %in%, &&, ||, apply, as.factor, as.numeric, colnames,
    colnames<-, ifelse, is.character, is.factor, is.numeric, log,
    log10, log1p, log2, round, signif, trunc

Then to lunch the cluster run the following script

h2o.init(nthreads = -1)


H2O is not running yet, starting it now...

Note:  In case of errors look at the following log files:
    C:\Users\dell\AppData\Local\Temp\RtmpGuHBDV\file2e5438ee3e8c/h2o_dell_started_from_r.out
    C:\Users\dell\AppData\Local\Temp\RtmpGuHBDV\file2e54103214ed/h2o_dell_started_from_r.err


Starting H2O JVM and connecting: . Connection successful!

R is connected to the H2O cluster: 
    H2O cluster uptime:         6 seconds 974 milliseconds 
    H2O cluster timezone:       Europe/Paris 
    H2O data parsing timezone:  UTC 
    H2O cluster version:        3.30.1.3 
    H2O cluster version age:    13 days  
    H2O cluster name:           H2O_started_from_R_dell_sgv874 
    H2O cluster total nodes:    1 
    H2O cluster total memory:   1.99 GB 
    H2O cluster total cores:    4 
    H2O cluster allowed cores:  4 
    H2O cluster healthy:        TRUE 
    H2O Connection ip:          localhost 
    H2O Connection port:        54321 
    H2O Connection proxy:       NA 
    H2O Internal Security:      FALSE 
    H2O API Extensions:         Amazon S3, Algos, AutoML, Core V3, TargetEncoder, Core V4 
    R Version:                  R version 4.0.1 (2020-06-06)

Looking at this output, we see that h2o uses java virtual machine JVM, so you need java already installed. If you notice I have specified the nthreads argument to be -1 to tell h20 to create its cluster using all the available cores I have less than 1.

Since our purpose is understanding how to work with h2o, we are going be using a small data set, in which the response will be a binary variable. The data that we will use is creditcard which is downloaded from kaggle website.

2 data preparation

To import the data directly into the h2o cluster we use the function h2O.importFile as follows.

card <- h2o.importFile("../creditcard.csv")


  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |=============================================================         |  87%
  |                                                                            
  |======================================================================| 100%

The following script gives the dimension of this data.

h2o.dim(card)

[1] 284807     31

This data has 284807 and 31 columns. According to the description of this data, the response variable is class with two values 1 for fraudulent card and 0 for regular card. The other variables are PCA components derived from the original ones for privacy purposes to protect, for instance, the users’ identities.
So first let’s check the summary of this data.

knitr::kable(h2o.describe(card))

Label	Type	Zeros	Min	Max	Mean	Sigma	Cardinality
Time	int	2	0.000000	1.727920e+05	9.481386e+04	4.748815e+04	NA
V1	real	0	-56.407510	2.454930e+00	0.000000e+00	1.958696e+00	NA
V2	real	0	-72.715728	2.205773e+01	0.000000e+00	1.651309e+00	NA
V3	real	0	-48.325589	9.382558e+00	0.000000e+00	1.516255e+00	NA
V4	real	0	-5.683171	1.687534e+01	0.000000e+00	1.415869e+00	NA
V5	real	0	-113.743307	3.480167e+01	0.000000e+00	1.380247e+00	NA
V6	real	0	-26.160506	7.330163e+01	0.000000e+00	1.332271e+00	NA
V7	real	0	-43.557242	1.205895e+02	0.000000e+00	1.237094e+00	NA
V8	real	0	-73.216718	2.000721e+01	0.000000e+00	1.194353e+00	NA
V9	real	0	-13.434066	1.559499e+01	0.000000e+00	1.098632e+00	NA
V10	real	0	-24.588262	2.374514e+01	0.000000e+00	1.088850e+00	NA
V11	real	0	-4.797473	1.201891e+01	0.000000e+00	1.020713e+00	NA
V12	real	0	-18.683715	7.848392e+00	0.000000e+00	9.992014e-01	NA
V13	real	0	-5.791881	7.126883e+00	0.000000e+00	9.952742e-01	NA
V14	real	0	-19.214326	1.052677e+01	0.000000e+00	9.585956e-01	NA
V15	real	0	-4.498945	8.877742e+00	0.000000e+00	9.153160e-01	NA
V16	real	0	-14.129855	1.731511e+01	0.000000e+00	8.762529e-01	NA
V17	real	0	-25.162799	9.253526e+00	0.000000e+00	8.493371e-01	NA
V18	real	0	-9.498746	5.041069e+00	0.000000e+00	8.381762e-01	NA
V19	real	0	-7.213527	5.591971e+00	0.000000e+00	8.140405e-01	NA
V20	real	0	-54.497720	3.942090e+01	0.000000e+00	7.709250e-01	NA
V21	real	0	-34.830382	2.720284e+01	0.000000e+00	7.345240e-01	NA
V22	real	0	-10.933144	1.050309e+01	0.000000e+00	7.257016e-01	NA
V23	real	0	-44.807735	2.252841e+01	0.000000e+00	6.244603e-01	NA
V24	real	0	-2.836627	4.584549e+00	0.000000e+00	6.056471e-01	NA
V25	real	0	-10.295397	7.519589e+00	0.000000e+00	5.212781e-01	NA
V26	real	0	-2.604551	3.517346e+00	0.000000e+00	4.822270e-01	NA
V27	real	0	-22.565679	3.161220e+01	0.000000e+00	4.036325e-01	NA
V28	real	0	-15.430084	3.384781e+01	0.000000e+00	3.300833e-01	NA
Amount	real	1825	0.000000	2.569116e+04	8.834962e+01	2.501201e+02	NA
Class	int	284315	0.000000	1.000000e+00	1.727500e-03	4.152720e-02	NA

The most important issues that we usually check first are missing values and imbalance problems for classification.

For the missing values, you should know that a value recognized by R as a missing value if it is written as NA or blank cells. If, otherwise a missing value in imported data written in any other format, for instance, in a string format like na or missing, we should tell R that these are missing values to be converted to NA. Or like in our case when a variable takes zero value while it should not have to take it. The Amount variable, for instance, we know that any transaction requires some amount of money so that it should not be equal to zero, while in the data it has 1825 zero’s. the same thing applies for the Time variable with two zero’s. However, since the data is large then this is not a big issue, and we can comfortably remove these rows.

card$Amount <- h2o.ifelse(card$Amount == 0, NA, card$Amount)
card$Time <- h2o.ifelse(card$Time == 0, NA, card$Time)
card <- h2o.na_omit(card)

it is a good practice to check your output after each transformation to make sure your code did what would be expected.

knitr::kable(h2o.describe(card))

Label	Type	Zeros	Min	Max	Mean	Sigma	Cardinality
Time	int	0	1.000000	1.727920e+05	94849.6338858	4.748196e+04	NA
V1	real	0	-56.407510	2.454930e+00	-0.0003483	1.956753e+00	NA
V2	real	0	-72.715728	2.205773e+01	-0.0020179	1.650496e+00	NA
V3	real	0	-48.325589	9.382558e+00	-0.0033027	1.514214e+00	NA
V4	real	0	-5.683171	1.687534e+01	-0.0119933	1.404852e+00	NA
V5	real	0	-113.743307	3.480167e+01	-0.0022396	1.378819e+00	NA
V6	real	0	-26.160506	7.330163e+01	-0.0013051	1.331596e+00	NA
V7	real	0	-43.557242	1.205895e+02	0.0025090	1.233944e+00	NA
V8	real	0	-73.216718	2.000721e+01	0.0000269	1.191177e+00	NA
V9	real	0	-13.320155	1.559499e+01	0.0014642	1.099065e+00	NA
V10	real	0	-24.588262	2.374514e+01	-0.0022783	1.087587e+00	NA
V11	real	0	-4.797473	1.201891e+01	0.0023114	1.018693e+00	NA
V12	real	0	-18.683715	7.848392e+00	0.0008656	9.972279e-01	NA
V13	real	0	-5.791881	7.126883e+00	0.0006992	9.945502e-01	NA
V14	real	0	-19.214326	1.052677e+01	0.0002020	9.555395e-01	NA
V15	real	0	-4.498945	8.877742e+00	0.0036456	9.137113e-01	NA
V16	real	0	-14.129855	1.731511e+01	-0.0010958	8.760560e-01	NA
V17	real	0	-25.162799	9.253526e+00	0.0016190	8.462568e-01	NA
V18	real	0	-9.498746	5.041069e+00	0.0013067	8.386969e-01	NA
V19	real	0	-7.213527	5.591971e+00	0.0019147	8.119902e-01	NA
V20	real	0	-54.497720	3.942090e+01	0.0009807	7.705625e-01	NA
V21	real	0	-34.830382	2.720284e+01	-0.0000481	7.326525e-01	NA
V22	real	0	-10.933144	1.050309e+01	-0.0016073	7.255767e-01	NA
V23	real	0	-44.807735	2.252841e+01	0.0001474	6.230342e-01	NA
V24	real	0	-2.836627	4.584549e+00	0.0002018	6.057968e-01	NA
V25	real	0	-10.295397	7.519589e+00	-0.0005087	5.209869e-01	NA
V26	real	0	-2.604551	3.517346e+00	-0.0013648	4.819297e-01	NA
V27	real	0	-22.565679	3.161220e+01	0.0002533	4.029874e-01	NA
V28	real	0	-15.430084	3.384781e+01	0.0001927	3.303524e-01	NA
Amount	real	0	0.010000	2.569116e+04	88.9194915	2.508252e+02	NA
Class	int	282515	0.000000	1.000000e+00	0.0016432	4.050350e-02	NA

In contrast, we have a very serious imbalance problem since the class variable, with only two values 1 and 0, has its mean equals 0.00173 which means that we have a large number of class label 0.

h2o.table(card$Class)

  Class  Count
1     0 282515
2     1    465

[2 rows x 2 columns]

As expected, the majority of cases are of class label 0. Any machine learning model fitted to this data without correcting this problem will be dominated by the label 0, and will hardly correctly predict the fraudulent card (label 1) which is our main interest.

The h2o package provides a way to correct the imbalance problem. For glm models, for instance, we have three arguments for this purpose:

balance_classes: if it is set to true then it performs subsampling method by default, or specified in the next argument.
class_sampling_factors: The desired sampling ratios per class (over or under-sampling).
max_after_balance_size: The desired relative size of the training data after balancing class counts.

Before going ahead we should split the data randomly between training (80% of the data) and testing set (the rest 20%).

card$Class <- h2o.asfactor(card$Class)

parts <- h2o.splitFrame(card, 0.8, seed = 1111)
train <- parts[[1]]
test <- parts[[2]]

h2o.table(train$Class)

  Class  Count
1     0 226268
2     1    372

[2 rows x 2 columns]

h2o.table(test$Class)

  Class Count
1     0 56247
2     1    93

[2 rows x 2 columns]

3 Logistic regression

For binary classification problems, the first model that comes in mind is the logistic regression model. This model belongs to the glm models such that when we set the argument family to binomial we get a logistic regression model. The following are the main arguments of **glm* models (besides the arguments discussed above):

x: should contains the predictor names (not the data) or their indices.
y: the name of the response variable (again not the whole column).
training frame: The training data frame.
model_id: to name the model.
nfolds: the number of folds to use for cross-validation for hyperparameters tuning.
seed: for reproducibility.
fold_assignment: the skim of the cross-validation: AUTO, Random, Stratified, or Modulo.
family: many distributions are provided, for binary we have binomial, quasibinomial.
solver: the algorithm used, with AUTO, will decide the best one given the data, but you can choose another one like IRLSM, L_BFGS, COORDINATE_DESCENT, …etc.
alpha: ratio to mix the regularization L1 (lasso) and L2(ridge regression). larger values yield more lasso.
lambda_search: lambda is the strength of the L2 regularization. If TRUE then the model tries different values.
standardize: to standardize the numeric columns.
compute_p_value: it does not work with regularization.
link: the link function.
interaction: if we want interaction between predictors.

Now we are ready to train our model with some specified values. But first, let’s try to use the original data without correcting the imbalance problem.

model_logit <- h2o.glm(
  x = 1:30,
  y = 31,
  training_frame = train,
  model_id = "glm_binomial_no_eg",
  seed = 123,
  lambda = 0,
  family = "binomial",
  solver = "IRLSM",
  standardize = TRUE,
  link = "family_default"
)


  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |===                                                                   |   4%
  |                                                                            
  |=======                                                               |  10%
  |                                                                            
  |======================================================================| 100%

h2o provides a bunch of metrics already computed during the training process along with the confusion matrix. we can get access to them by calling the function h2O.performance.

h2o.performance(model_logit)

H2OBinomialMetrics: glm
** Reported on training data. **

MSE:  0.0006269349
RMSE:  0.02503867
LogLoss:  0.003809522
Mean Per-Class Error:  0.1103587
AUC:  0.9731273
AUCPR:  0.7485898
Gini:  0.9462545
R^2:  0.6174137
Residual Deviance:  1726.78
AIC:  1788.78

Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
            0   1    Error         Rate
0      226203  65 0.000287   =65/226268
1          82 290 0.220430      =82/372
Totals 226285 355 0.000649  =147/226640

Maximum Metrics: Maximum metrics at their respective thresholds
                        metric threshold         value idx
1                       max f1  0.135889      0.797799 176
2                       max f2  0.058169      0.808081 200
3                 max f0point5  0.475561      0.833333 102
4                 max accuracy  0.135889      0.999351 176
5                max precision  0.999976      0.909091   0
6                   max recall  0.000027      1.000000 397
7              max specificity  0.999976      0.999960   0
8             max absolute_mcc  0.135889      0.797693 176
9   max min_per_class_accuracy  0.001118      0.919355 345
10 max mean_per_class_accuracy  0.002782      0.934336 314
11                     max tns  0.999976 226259.000000   0
12                     max fns  0.999976    282.000000   0
13                     max fps  0.000007 226268.000000 399
14                     max tps  0.000027    372.000000 397
15                     max tnr  0.999976      0.999960   0
16                     max fnr  0.999976      0.758065   0
17                     max fpr  0.000007      1.000000 399
18                     max tpr  0.000027      1.000000 397

Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`

To extract only the confusion matrix we call the function h2O.confusionMatrix

h2o.confusionMatrix(model_logit)

Confusion Matrix (vertical: actual; across: predicted)  for max f1 @ threshold = 0.135888872638703:
            0   1    Error         Rate
0      226203  65 0.000287   =65/226268
1          82 290 0.220430      =82/372
Totals 226285 355 0.000649  =147/226640

By looking at the confusion matrix, we get a very low error rate for the major label (0.029%), whereas, the error rate for the minor label is quite high (22.04%). This result is expected since the data is highly dominated by the label “0”.

h2o.confusionMatrix(model_logit, test)

Confusion Matrix (vertical: actual; across: predicted)  for max f1 @ threshold = 0.0767397449673996:
           0  1    Error       Rate
0      56223 24 0.000427  =24/56247
1         19 74 0.204301     =19/93
Totals 56242 98 0.000763  =43/56340

Using the testing set, the error rate of the major class is a little larger than its corresponding one for the training data 0.043%. Whereas, the error rate of the minor class is smaller than its corresponding one 20.43% (22.04%).

We can correct the imbalance problem by setting the argument balance_classes to TRUE. Unfortunately, I trained many times this model but it seemed this argument does not work for some reason. I do not know this problem occurs in this version of h20 for everyone or just for me due to some problems with my laptop. Anyway, I put an issue in stackoverflow about it but I do not get yet any answer at the time of writing.

we can correct the imbalance problem by loading the data as data frame into R, and using Rose package then converting back the corrected data to h2o object.

Note: This possibility of loading data from h2o to R will not be always possible for a very large dataset. I am using this alternative only to carry on our analysis and do not get stacked.

train_R <- as.data.frame(train)
train_balance <- ROSE::ROSE(Class~., data=train_R, seed=111)$data
table(train_balance$Class)


     0      1 
113244 113396

Now we feed this corrected data to our model again after converting it back to h2o.

train_h <- as.h2o(train_balance)


  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |======================================================================| 100%

model_logit2 <- h2o.glm(
  x = 1:30,
  y = 31,
  training_frame = train_h,
  model_id = "glm_binomial_balance",
  seed = 123,
  lambda = 0,
  family = "binomial",
  solver = "IRLSM",
  standardize = TRUE,
  link = "family_default"
)


  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |===========                                                           |  16%
  |                                                                            
  |======================================================================| 100%

We can check the confusion matrix as follows.

h2o.confusionMatrix(model_logit2)

Confusion Matrix (vertical: actual; across: predicted)  for max f1 @ threshold = 0.448958365921134:
            0      1    Error           Rate
0      110591   2653 0.023427   =2653/113244
1       12594 100802 0.111062  =12594/113396
Totals 123185 103455 0.067274  =15247/226640

As the reliable measure of the model performance is the unseen data, so let’s use our testing set.

h2o.confusionMatrix(model_logit2, test)

Confusion Matrix (vertical: actual; across: predicted)  for max f1 @ threshold = 0.9289188100923:
           0   1    Error       Rate
0      56200  47 0.000836  =47/56247
1         16  77 0.172043     =16/93
Totals 56216 124 0.001118  =63/56340

Since we are more interested to the minor class so we will consider an improvement if getting lower rate for the minor class. After correcting the class imbalance problem, The minor class rate has reduced from 20.43% to 17.20%.

One strategy to improve our model is to remove the less important variables by hand using a threshold. h2o provides a function to list the predictors in decreasing order of their importance in predicting the response variable. So we can think to remove the less important variable with the hope to reduce the error rate of the minor class.

h2o.varimp(model_logit)

   variable relative_importance scaled_importance   percentage
1        V4         0.994893006       1.000000000 0.1440671995
2       V10         0.916557351       0.921262231 0.1327236697
3       V14         0.532427886       0.535160950 0.0770991394
4       V22         0.481085303       0.483554815 0.0696643880
5        V9         0.371758112       0.373666424 0.0538330752
6       V20         0.360902368       0.362754956 0.0522610906
7       V27         0.340929107       0.342679168 0.0493688281
8       V13         0.324390153       0.326055315 0.0469738762
9       V21         0.309050873       0.310637296 0.0447526453
10      V16         0.230524987       0.231708320 0.0333815688
11   Amount         0.213887758       0.214985688 0.0309723861
12       V8         0.211491780       0.212577411 0.0306254323
13     Time         0.209412404       0.210487362 0.0303243248
14       V6         0.189761276       0.190735361 0.0274787093
15       V5         0.176081346       0.176985209 0.0254977634
16       V1         0.164618852       0.165463875 0.0238379171
17      V12         0.134542774       0.135233410 0.0194826987
18      V11         0.129031560       0.129693906 0.0186846379
19      V28         0.093633317       0.094113956 0.0135587341
20      V26         0.093283287       0.093762130 0.0135080475
21      V17         0.081893193       0.082313568 0.0118586852
22       V7         0.077962451       0.078362649 0.0112894873
23      V23         0.067840817       0.068189058 0.0098238066
24      V18         0.065741510       0.066078975 0.0095198129
25      V25         0.033325258       0.033496323 0.0048257215
26       V2         0.029047974       0.029197083 0.0042063420
27      V24         0.025833162       0.025965769 0.0037408156
28      V19         0.022354254       0.022469003 0.0032370463
29      V15         0.020189854       0.020293493 0.0029236267
30       V3         0.003304571       0.003321534 0.0004785241

Or as plot as follows:

h2o.varimp_plot(model_logit)

Another strategy to remove the less important variables, which is better, is by using the lasso regression (L1) that can strip out the less important ones automatically, known also as a feature selection method. Lasso, like ridge regression (L2), is a regularization technique to fight overfitting problems, and besides that, it is also known as a reduction technique since it reduces the number of predictors. We enable this method in h2o by setting alpha=1, where alpha is a ratio to the trade-off between lasso (L1) or ridge regression (L2). alpha closer to zero means more ridge than lasso.

model_lasso <- h2o.glm(
  x = 1:30,
  y = 31,
  training_frame = train_h,
  model_id = "glm_binomial_lasso",
  seed = 123,
  alpha = 1,
  family = "binomial",
  solver = "IRLSM",
  standardize = TRUE,
  link = "family_default"
)


  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |====                                                                  |   6%
  |                                                                            
  |======================================================================| 100%

h2o.confusionMatrix(model_lasso)

Confusion Matrix (vertical: actual; across: predicted)  for max f1 @ threshold = 0.429938956856689:
            0      1    Error           Rate
0      110315   2929 0.025865   =2929/113244
1       12339 101057 0.108813  =12339/113396
Totals 122654 103986 0.067367  =15268/226640

Using the testing set, the confusion matrix will be:

h2o.confusionMatrix(model_lasso, test)

Confusion Matrix (vertical: actual; across: predicted)  for max f1 @ threshold = 0.958116537926135:
           0   1    Error       Rate
0      56210  37 0.000658  =37/56247
1         20  73 0.215054     =20/93
Totals 56230 110 0.001012  =57/56340

With the lasso model, the error rate of the minor class has increased from 17.20% to 21.50%, which is in contradiction with the improvement recorded in the rate computed from the training data where the rate has decreased from 11.10% to 10.88% with lasso model.

The last thing about hyperparameters tuning is that some of which are not supported by h2o.grid function like, for instance, the solver argument. But this not an issue since we can recycle a loop over the hyperparameters in question. Let’s try to explore the most popular solvers by using the R lapply function.

solvers <- c(
  "IRLSM",
  "L_BFGS",
  "COORDINATE_DESCENT"
)

mygrid <- lapply(solvers, function(solver) {
  grid_id <- paste0("glm_", solver)
  h2o.glm(
    x = 1:30,
    y = 31,
    training_frame = train_h,
    seed = 123,
    model_id = paste0("logit_", solver),
    family = "binomial",
    solver = solver,
    standardize = TRUE,
    link = "family_default"
  )
   
})


  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |=                                                                     |   2%
  |                                                                            
  |===========                                                           |  16%
  |                                                                            
  |======================================================================| 100%

  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |======================================================================| 100%

  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |=                                                                     |   2%
  |                                                                            
  |=============                                                         |  18%
  |                                                                            
  |======================================================================| 100%

df <- cbind(
  h2o.confusionMatrix(mygrid[[1]])$Error,
  h2o.confusionMatrix(mygrid[[2]])$Error,
  h2o.confusionMatrix(mygrid[[3]])$Error
)
df <- t(round(df, digits = 6))
dimnames(df) <- list(
  list("IRLSM", "L_BFGS",  "COORDINATE_DESCENT"),
  list("Error (0)", "Error (1)", "Total Error")
  
)
df

                   Error (0) Error (1) Total Error
IRLSM               0.024169  0.110313    0.067270
L_BFGS              0.024354  0.110189    0.067301
COORDINATE_DESCENT  0.025909  0.109051    0.067508

It seems there is no significant difference between these solvers. If we focus, however, on the error of the minor class, it seems that the COORDINATE_DESCT is the best one with the lowest error. But it can be the result of random chances since we did not use cross-validation.

4 Random forest

The random forest model is the most popular machine learning model due to its capability to capture even complex patterns in the data. This is also, however, can be considered at the same time as a downside, since this capability tends to exceedingly memorize everything in the data including the noise, which gives rise to the overfitting problem. That is why this model has a large number of hyperparameters for regularization techniques, among others, to control the training process. The main hyperparameters provided by h2O are the following¹ :

seed: for reproducibility.
ntrees: The number of trees used (called also iterations). The default is 50.
max_depth: The maximum level allowed for each tree. The default is 20.
mtries: The number of the columns chosen randomly for each tree. The default is $\sqrt{p}$ for classification, and $\frac{p}{3}$ for regression (where p is the number of columns).
sample_rate: the proportion of the training data selected randomly at each tree. The default is 63.2%.
balance_classes: This the most important hyperparameters for our data, since it is highly imbalanced. The default is false, if set to true then the model will correct this problem by making use of over/under sampling methods.
min_rows: the minimum number of instances in a node to allow for splitting this node. the default is 1.
min_split_improvement: The minimum error reduction to make further splitting. The default is 0.
binomial_double_trees: for binary classification. If true then the model two random forests, one for each output class. this method cn give high accuracy with the cost of doubling the computation time.
stopping_rounds: The number of iterations required to early stopping the training process if the moving average of the stopping_metric (based on this number of iterations) does not improve. The default is 0, which means the early stopping is disabled.
stopping_metric: works with the last argument. The default is AUTO, that is the logloss for classification, deviance for regression, but we have also MSE, RMSE, MAE, AUC, misclasssification.
stopping_tolerance: The threshold under which we consider no improvement. The default is 0.001.

First let’s try this model with the default values, except for balance_classes that we set to true. Fortunately, unlike glm models, this argument works fine with random forest model.

model_rf <- h2o.randomForest(
  x = 1:30,
  y = 31,
  training_frame = train,
  seed = 123,
  model_id = "rf_default",
  balance_classes = TRUE
)


  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |=                                                                     |   2%
  |                                                                            
  |===                                                                   |   4%
  |                                                                            
  |====                                                                  |   6%
  |                                                                            
  |=======                                                               |  10%
  |                                                                            
  |==========                                                            |  14%
  |                                                                            
  |=============                                                         |  18%
  |                                                                            
  |===============                                                       |  22%
  |                                                                            
  |==================                                                    |  26%
  |                                                                            
  |=====================                                                 |  30%
  |                                                                            
  |========================                                              |  34%
  |                                                                            
  |===========================                                           |  38%
  |                                                                            
  |============================                                          |  40%
  |                                                                            
  |================================                                      |  46%
  |                                                                            
  |===================================                                   |  50%
  |                                                                            
  |====================================                                  |  52%
  |                                                                            
  |=======================================                               |  56%
  |                                                                            
  |==========================================                            |  60%
  |                                                                            
  |=============================================                         |  64%
  |                                                                            
  |==============================================                        |  66%
  |                                                                            
  |==================================================                    |  72%
  |                                                                            
  |=====================================================                 |  76%
  |                                                                            
  |========================================================              |  80%
  |                                                                            
  |=========================================================             |  82%
  |                                                                            
  |============================================================          |  86%
  |                                                                            
  |===============================================================       |  90%
  |                                                                            
  |===================================================================   |  96%
  |                                                                            
  |======================================================================| 100%

Now we check how this model did with the training data.

h2o.performance(model_rf)

H2OBinomialMetrics: drf
** Reported on training data. **
** Metrics reported on Out-Of-Bag training samples **

MSE:  0.03503944
RMSE:  0.1871882
LogLoss:  0.1012676
Mean Per-Class Error:  6.629307e-06
AUC:  0.999995
AUCPR:  0.9999938
Gini:  0.9999901
R^2:  0.8598422

Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
            0      1    Error       Rate
0      226265      3 0.000013  =3/226268
1           0 226262 0.000000  =0/226262
Totals 226265 226265 0.000007  =3/452530

Maximum Metrics: Maximum metrics at their respective thresholds
                        metric threshold         value idx
1                       max f1  0.060268      0.999993 397
2                       max f2  0.060268      0.999997 397
3                 max f0point5  0.060268      0.999989 397
4                 max accuracy  0.060268      0.999993 397
5                max precision  1.000000      1.000000   0
6                   max recall  0.060268      1.000000 397
7              max specificity  1.000000      1.000000   0
8             max absolute_mcc  0.060268      0.999987 397
9   max min_per_class_accuracy  0.060268      0.999987 397
10 max mean_per_class_accuracy  0.060268      0.999993 397
11                     max tns  1.000000 226268.000000   0
12                     max fns  1.000000 132282.000000   0
13                     max fps  0.000002 226268.000000 399
14                     max tps  0.060268 226262.000000 397
15                     max tnr  1.000000      1.000000   0
16                     max fnr  1.000000      0.584641   0
17                     max fpr  0.000002      1.000000 399
18                     max tpr  0.060268      1.000000 397

Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`

Surprisingly, the model is almost perfect with 0.0007% overall error rate, which is very suspicious, since this model memorized everything even the noisy patterns. The real challenge for every model is how it generalizes to unseen data, that is why we should always hold out some data as testing data to test the model performance.

h2o.confusionMatrix(model_rf, test)

Confusion Matrix (vertical: actual; across: predicted)  for max f1 @ threshold = 0.00208813635434166:
           0  1    Error       Rate
0      56240  7 0.000124   =7/56247
1         16 77 0.172043     =16/93
Totals 56256 84 0.000408  =23/56340

As expected the model overfitted the data. The error rate of the minor class is now very large which is the same as that obtained from the lasso model.

4.1 Random forest with binomial double trees

Before going ahead with hyperparameters tuning, let’s try the binomial double trees technique discussed above.

model_rf_dbl <- h2o.randomForest(
  x = 1:30,
  y = 31,
  training_frame = train,
  seed = 123,
  model_id = "rf_default",
  binomial_double_trees = TRUE 
)


  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |=                                                                     |   2%
  |                                                                            
  |====                                                                  |   6%
  |                                                                            
  |=======                                                               |  10%
  |                                                                            
  |==========                                                            |  14%
  |                                                                            
  |==============                                                        |  20%
  |                                                                            
  |=================                                                     |  24%
  |                                                                            
  |=====================                                                 |  30%
  |                                                                            
  |========================                                              |  34%
  |                                                                            
  |============================                                          |  40%
  |                                                                            
  |================================                                      |  46%
  |                                                                            
  |====================================                                  |  52%
  |                                                                            
  |=========================================                             |  58%
  |                                                                            
  |=============================================                         |  64%
  |                                                                            
  |==================================================                    |  72%
  |                                                                            
  |=====================================================                 |  76%
  |                                                                            
  |=========================================================             |  82%
  |                                                                            
  |==============================================================        |  88%
  |                                                                            
  |==================================================================    |  94%
  |                                                                            
  |======================================================================| 100%

h2o.confusionMatrix(model_rf_dbl)

Confusion Matrix (vertical: actual; across: predicted)  for max f1 @ threshold = 0.473692341854698:
            0   1    Error        Rate
0      226258  10 0.000044  =10/226268
1          83 289 0.223118     =83/372
Totals 226341 299 0.000410  =93/226640

h2o.confusionMatrix(model_rf_dbl, test)

Confusion Matrix (vertical: actual; across: predicted)  for max f1 @ threshold = 0.42:
           0  1    Error       Rate
0      56239  8 0.000142   =8/56247
1         13 80 0.139785     =13/93
Totals 56252 88 0.000373  =21/56340

As we see, this model is the best one until now with the lowest rate for the minor class at 13.98%.

4.2 Random forest tuning

We can try to tune the hyperparameters related to the regularization techniques to fight the overfitting problem. For instance, we use lower values for max_depth and larger values for min_rows to prune the trees, lower values for sample_rate to let each tree focus on a small part of the training data. We set also some values to early stop the training process if we do not obtain significant improvement. Finally, to avoid the randomness of the results we use cross-validation.

#model_rftuned <- h2o.grid(
#  "randomForest",
#  hyper_params = list(
#    max_depth = c(5, 10),
#    min_rows = c(10, 20, 30),
#    sample_rate = c(0.3, 0.5)
#  ),
# stopping_rounds = 5,
#  stopping_metric = "AUTO",
#  stopping_tolerance = 0.001,
#  balance_classes = TRUE,
#  nfolds = 5,
#  fold_assignment = "Stratified",
#  x = 1:30,
#  y = 31,
#  training_frame = train
#)

Since this model took a lot of time I saved the following output in csv file then I loaded it again.

#df_output <- model_rftuned@summary_table %>% 
#  select(max_depth, min_rows, sample_rate, logloss) %>% 
#  arrange(logloss)
#write.csv(df_output, "df_output.csv",  row.names = F)
df_output <- read.csv("df_output.csv")
knitr::kable(df_output)

max_depth	min_rows	sample_rate	logloss
10	30	0.3	0.0041177
10	20	0.3	0.0041834
10	30	0.5	0.0043959
5	30	0.3	0.0044893
10	10	0.3	0.0045269
5	20	0.5	0.0045655
10	10	0.5	0.0045780
10	20	0.5	0.0046402
5	20	0.3	0.0046463
5	10	0.3	0.0046960
5	30	0.5	0.0047160
5	10	0.5	0.0047175

Using the logloss metric, the best model is obtained with 10 for max_depth, 30 for min_rows, and the sample rate is about 0.3. Now let’s run this model with these values.

model_rf_best <- h2o.randomForest(
  x = 1:30,
  y = 31,
  training_frame = train,
  seed = 123,
  model_id = "rf_best",
  max_depth = 10,
  min_rows = 30,
  sample_rate = 0.3,
  stopping_rounds = 5,
  stopping_metric = "AUTO",
  stopping_tolerance = 0.001,
  balance_classes = TRUE
)


  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |=                                                                     |   2%
  |                                                                            
  |======                                                                |   8%
  |                                                                            
  |=======                                                               |  10%
  |                                                                            
  |===========                                                           |  16%
  |                                                                            
  |=================                                                     |  24%
  |                                                                            
  |======================                                                |  32%
  |                                                                            
  |============================                                          |  40%
  |                                                                            
  |================================                                      |  46%
  |                                                                            
  |======================================                                |  54%
  |                                                                            
  |===========================================                           |  62%
  |                                                                            
  |================================================                      |  68%
  |                                                                            
  |=====================================================                 |  76%
  |                                                                            
  |===========================================================           |  84%
  |                                                                            
  |================================================================      |  92%
  |                                                                            
  |===================================================================== |  98%
  |                                                                            
  |======================================================================| 100%

h2o.confusionMatrix(model_rf_best)

Confusion Matrix (vertical: actual; across: predicted)  for max f1 @ threshold = 0.00180477647421117:
            0      1    Error          Rate
0      225985    283 0.001251   =283/226268
1        2582 223680 0.011412  =2582/226262
Totals 228567 223963 0.006331  =2865/452530

The model did well with the training data. But what about the testing set?.

h2o.confusionMatrix(model_rf_best, test)

Confusion Matrix (vertical: actual; across: predicted)  for max f1 @ threshold = 0.00496959295307637:
           0   1    Error       Rate
0      56226  21 0.000373  =21/56247
1         13  80 0.139785     =13/93
Totals 56239 101 0.000603  =34/56340

With this model, we get the same error rate for the minor class as the binomial double trees model. But for the overall error rate, the latter is better than the former.

5 Deep learning model

Deep learning models are known for their high accuracy at predicting very large and complex datasets. They have a large number of hyperparameters that can be tuned to efficiently handle a wide range of datasets. Tuning a large number of hyperparameters with large datasets, however, requires very large hardware resources and time, which is not always available or very costly (using the cloud providers’ platforms). That is why this type of model requires quite high experience and practice to be able to correctly set the right hyperparameter values.

There are many frameworks for deep learning models. The most used ones are tensorflow and keras since they are designed specifically for this type of models and can handle almost all the famous architectures such as: feedforward neural network, convolutional neural network, recurrent neural network,..etc. Besides, they can also provide us with some tools to define our architecture.

For h2o, it provides only the feedforward neural network which is densely connected layers. However, this type of architecture is the most used one in economics. We can briefly discuss the main hyperparameters provided by h2O for this type of models (in addition to some of the above hyperparameters):

hidden: we specify the number of the hidden layers and the number of nodes in each layer, the default is 2 layers with 200 nodes each. Notice that the number of nodes in the first and the last layers will be specified automatically by h2o given the data.
autoencoder: If true then we train autoencoder model, otherwise the model will use supervised learning which is the default.
activation: the activation function used. h2o provides three ones with or without dropout: Tanh, Rectifier, Maxout, TanhWithDropout, RecifierWithDropout, MaxoutWithDropout. The default is Rectifier.
hidden_dropout_ratio: it is a regularization technique. Drop randomly a fraction of node values from a hidden layer. The default is 0.5.
missing_values_handling: with two values MeanImputation and Skip. The default is MeanImputation.
input_dropout_ratio: The same as the previous argument but for the input layer. The default is 0.
L1 and L2: For lasso and ridge regularization. The default is 0 for both.
max_w2: It is the upper limit of the sum squared of the weights incoming to each node. This can help to fight the Exploding gradient problem.
train_samples_per_iteration: The number of samples used before declaring one iteration. At the end of one iteration, the model is scored. The default is -2, which means h2o will decide given the data.
score_interval: The alternative of the previous one, where the model will be scored after every 5 seconds with the default settings.
score_duty_cycle: It is another alternative to the two previous ones. It is the fraction of time spent in scoring, at the expense of that spent in training. The default is 0.1, which means 10% of the total time will be spent in scoring while the remaining 90% will be spent on training.
target_ratio_comm_to_comp: It is related to the cluster management. It controls the fraction of the communication time between nodes (The cluster nodes not the layer nodes). The default is 0.05, which means 5% of the total time will be spent on communication, and 95% in training inside each node.
replicate_training_data: The default is true, which means replicate the entire data on every cluster node.
shuffle_training_data: shuffle the inputs before feeding them into the network. It is recommended when we set balance_classes to true (like in our case). The default is false.
score_validation_samples: The number of samples from the validation set used in scoring. if we set this to 0 (which is the default) then the entire validation data will be used.
score_training_samples: The default is 10000, which the number of samples used from the training data to use in scoring. It is used when we do not have validation data.
score_validation_sampling: It is used when we use only a fraction of the validation (when the score_validation_samples has been specified with other values than the default of 0). The default is Uniform, but for our case with imbalance classes we can use instead Stratified, which is also provided as another value for this argument.

Since, in our case the two classes are imbalanced, we convert the balance_classes argument to true, then we leave all the other arguments to the default settings.

#model_deep <- h2o.deeplearning(
#  x = 1:30,
#  y = 31,
#  training_frame = train,
#  model_id = "deep_def",
#  balance_classes = TRUE
#)

As we did earlier, we save the model then we load it again to prevent rerunning the model when rendering this document.

#h2o.saveModel(model_deep, 
#              path = #"C://Users/dell/Documents/new-blog/content/sparklyr/h2o",
#              force = TRUE)
model_deep <- h2o.loadModel("C://Users/dell/Documents/new-blog/content/sparklyr/h2o/deep_def")

h2o.confusionMatrix(model_deep)

Confusion Matrix (vertical: actual; across: predicted)  for max f1 @ threshold = 9.38199481764512e-07:
          0    1    Error     Rate
0      4904    5 0.001019  =5/4909
1         0 5017 0.000000  =0/5017
Totals 4904 5022 0.000504  =5/9926

Like the above models, this model is almost perfect for predicting the training data.

h2o.confusionMatrix(model_deep, test)

Confusion Matrix (vertical: actual; across: predicted)  for max f1 @ threshold = 0.991909621262387:
           0  1    Error       Rate
0      56237 10 0.000178  =10/56247
1         17 76 0.182796     =17/93
Totals 56254 86 0.000479  =27/56340

As we see this model fails to predict very well the minor class. This result can be expected since we only used the default values. so let’s try using some custom hyperparameter values now.

Note: We will not tune any hyperparameters since we do not have many resources on my laptop.

As a guideline, since the above default deep learning model fitted almost perfectly the training data and it generalized poorly to the unseen testing data, then we should think to reduce the complexity of the model and some regularization methods. So we will set the following values.

hidden: we will use two hidden layers, with 100 each (instead of the default of 200 each).
nfolds: we will use 5 folds to properly score the model using validation data (not training data).
fold_assignment: set it to “Stratified” to be sure to get the minor class in all the folds. This is crucially important with imbalanced classes.
hidden_dropout_ratio: we set this to 0.2 for both layers.
activation: with the previous argument, we must provide the appropriate activation function RectifierWithDropout.
L1: we set this argument to 0.0001.
variable_importances: By default it is True, so we set it false to reduce computation time since our goal is a prediction, not explanation.
shuffle_training_data: since the replicate_training_data is true (by default), we set this to true (the default is false) to shuffle the training data.

#model_deep_new <- h2o.deeplearning(
#  x = 1:30,
#  y = 31,
#  training_frame = train,
#  nfolds = 5,
#  fold_assignment = "Stratified",
#  hidden = c(100,100),
#  model_id = "deep_new",
#  standardize = TRUE,
#  balance_classes = TRUE,
#  hidden_dropout_ratios = c(0.2,0.2),
#  activation = "RectifierWithDropout",
#  l1=1e-4,
#  variable_importances = FALSE,
#  shuffle_training_data = TRUE
#)

To prevent this model to be rerun when rendering our rmarkdown document, we save the model and load it again to further use.

#h2o.saveModel(model_deep_new, 
#              path = #"C://Users/dell/Documents/new-blog/content/sparklyr/h2o",
#              force = TRUE)
model_deep_new <- h2o.loadModel("C://Users/dell/Documents/new-blog/content/sparklyr/h2o/deep_new")

h2o.confusionMatrix(model_deep_new)

Confusion Matrix (vertical: actual; across: predicted)  for max f1 @ threshold = 1.69163968745202e-06:
          0    1    Error        Rate
0      4908   77 0.015446    =77/4985
1        51 5022 0.010053    =51/5073
Totals 4959 5099 0.012726  =128/10058

As expected, this model has less accuracy than the default one due to its less flexibility. In other words, it has a larger bias but we hope it has also lower variance, which can be verified by using the testing set.

h2o.confusionMatrix(model_deep_new, test)

Confusion Matrix (vertical: actual; across: predicted)  for max f1 @ threshold = 0.250769269598065:
           0  1    Error       Rate
0      56228 19 0.000338  =19/56247
1         15 78 0.161290     =15/93
Totals 56243 97 0.000603  =34/56340

With these new settings, we obtained a large improvement for the error rate of the minor class with 16% (compared to the default model with 18%). But this rate still larger than that of the best random forest model (13.97%). If you have enough time you can improve your model by applying a grid search to some hyperparameters.

Finally, when you finish your work do not forget to shut down your h2o to free your resources as follows:

h2o.shutdown()

Are you sure you want to shutdown the H2O instance running at http://localhost:54321/ (Y/N)?

6 Conclusion:

Maybe the most important thing learned from this article is how important the hyperparameter values on the model performance. The difference (of performance) can be larger between models of the same type (with different hyperparameter values) than the difference between different types of models. In other words, if you do not have enough time, so exploit your time to fine-tune the hyperparameters of the same model rather than try a different type of models. In practice, for large and complex datasets, the most powerful models are by order: Deep learning, Xgboost, and Random forest.

7 Further reading

Darren Cook, Practical Machine Learning with h2o, O’Reilly, 2017.
https://docs.h2o.ai

8 Session information

sessionInfo()

R version 4.0.1 (2020-06-06)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19041)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252 
[2] LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] h2o_3.30.1.3    forcats_0.5.0   stringr_1.4.0   dplyr_1.0.2    
 [5] purrr_0.3.4     readr_1.3.1     tidyr_1.1.2     tibble_3.0.3   
 [9] ggplot2_3.3.2   tidyverse_1.3.0

loaded via a namespace (and not attached):
 [1] tidyselect_1.1.0  xfun_0.18         haven_2.3.1       colorspace_1.4-1 
 [5] vctrs_0.3.4       generics_0.0.2    htmltools_0.5.0   yaml_2.2.1       
 [9] blob_1.2.1        rlang_0.4.7       pillar_1.4.6      glue_1.4.2       
[13] withr_2.3.0       DBI_1.1.0         bit64_4.0.5       dbplyr_1.4.4     
[17] modelr_0.1.8      readxl_1.3.1      lifecycle_0.2.0   munsell_0.5.0    
[21] blogdown_0.20     gtable_0.3.0      cellranger_1.1.0  rvest_0.3.6      
[25] codetools_0.2-16  evaluate_0.14     knitr_1.30        fansi_0.4.1      
[29] highr_0.8         broom_0.7.1       Rcpp_1.0.5        scales_1.1.1     
[33] backports_1.1.10  jsonlite_1.7.1    bit_4.0.4         fs_1.5.0         
[37] hms_0.5.3         digest_0.6.25     stringi_1.5.3     bookdown_0.20    
[41] grid_4.0.1        bitops_1.0-6      cli_2.0.2         tools_4.0.1      
[45] ROSE_0.0-3        magrittr_1.5      RCurl_1.98-1.2    crayon_1.3.4     
[49] pkgconfig_2.0.3   ellipsis_0.3.1    data.table_1.13.0 xml2_1.3.2       
[53] reprex_0.3.0      lubridate_1.7.9   assertthat_0.2.1  rmarkdown_2.4    
[57] httr_1.4.2        rstudioapi_0.11   R6_2.4.1          compiler_4.0.1

Darren cook, Practical Machine Learning Model With h2o, O’Reilly, 2017, p115↩︎

Bayesian hyperparameters optimization

Wed, 13 May 2020 00:00:00 +0000

1 Introduction

Machine learning models are called by this name because of their ability to learn the best parameter values that are as closest as possible to the right values of the optimum objective function (or loss function). However, since all models require some assumptions (like linearity in linear regression models), parameters (like the cost C in svm models), and settings (like the number of layers in deep learning models) to be prespecified before training the model (in most cases are set by default ), the name of machine learning is not fully justified. Theses prespecified parameters are called hyperparameters, and should be defined in such a way that the corresponding model reaches its best performance (conditionally on the data at hand).

The search of the best hyperparameters called tuning, which simply is training the model with each combination of hyperparameter values many times, using some techniques like cross-validation to make sure that the resulted loss functions highly likely depends on this specific combination values (not by random chances), and then pick the best one that gives the optimum value for the objective function. This means that if our model requires long computational time (due to fewer hardware resources, or large dataset) we cannot try a large number of combinations, and hence we will be likely far away from the best result.

The main problem, however, is how do we define the space of these hyperparameters to choose from. Many methods are available:

grid search: Using this method the modeler provides the values for each parameter to evaluate and then pick the combination that gives the best result. However, this method based entirely on the experience of the modeler with the model in question and the data at hand. So with not enough experience, which is often the case, the choice in most cases would be arbitrary.
Random search: with this method, we choose randomly the values for each hyperparameter, and it turns out that this method is sometimes more accurate than the previous one. However, this method also suffers from the arbitrary choice of values.
Bayesian hyperparameters: This method uses Bayesian optimization to guide a little bit the search strategy to get the best hyperparameter values with minimum cost (the cost is the number of models to train). We will briefly discuss this method, but if you want more detail you can check the following great article.

we will focus on the best one which is Bayesian hyperparameters, but we first start by briefly introducing the others.

To well understand these methods we will make use of small dataset with a small number of predictors, and we will use two models, the machine learning model Random forest and the deep learning model feedforward neural network.

2 Bayesian optimization

The main idea behind this method is very simple, at the first iteration we pick a point at random, then at each iteration, and based on Bayes rule, we make a trade-off between choosing the point that has the highest uncertainty (known as active learning) or choosing the point within the region that has already the best result (optimum objective function) until the current iteration. In more detail, let’s say we are dealing with maximization problem such as maximizing the accuracy rate for a classification problem, then, at each iteration, the Bayes optimization method should decide between focusing the search on the region that contains the best point (that resulted in the maximum objective function) until the current iteration (called exploitation), or inspecting another region with the highest uncertainty (called exploration). The question, however, is how can this method decide between these two options. Well, this method uses what is called acquisition function, which is a function that helps to decide what region should we inspect in the next iteration.

2.1 Acquisition functions

Since this problem is always partially random, there exist many forms for this function, and we will discuss briefly the most common ones.

2.1.1 Upper confidence bound

Using this function, the following located point will be that that has the highest upper confidence bound, and, assuming the Gaussian process, this bound will be computed as follows:

\[UCB(x)=\mu(x)+\kappa\sigma(x)\]

Where $\mu$ and $\sigma$ are the mean and the standard deviation defined by the Gaussian process, and $\kappa$ is an exploration parameter, the larger the values the more the exploration.

2.1.2 Probability of improvement PI

this acquisition function chooses the next point that has the highest probability of improvement over the current maximum objective function $f_{max}$ obtained from the points evaluated previously.

Assuming the Gaussian process, the new point then will be the one that has the highest following probability:

\[PI(x)=\Phi\left(\frac{\mu(x)-f{max}-\varepsilon}{\sigma(x)}\right)\] Where $\varepsilon$ plays the role for trading off between exploration and exploitation, such that larger values result in more exploration than exploitation

2.1.3 Expected improvement EI

This acquisition function, unlike the previous one, tries to quantify how much improvement we get with the new point, the new point that has the maximum expected value will be chosen.

Again, by assuming the Gaussian process, this function can be computed as follows:

\[EI(x) = \left(\mu(x)-f_{max}\right)\Phi\left(\frac{\mu(x)-f{max}-\varepsilon}{\sigma(x)}\right)+\sigma(x)\phi\left(\frac{\mu(x)-f{max}-\varepsilon}{\sigma(x)}\right)\]

3 Data preparation

Let’s first call the packages needed along this article.

options(warn=-1)
library(readr)
library(tidymodels)
library(themis)
library(plot3D)

For our illustration, we will use data downloaded from the link described in the below script. It is already split between a training set and a testing set. The target variable is a highly imbalanced binary variable, and the data has a very large number of missing values. To make our analysis simpler, however, we will reduce the size of this data by first, removing the variables with a large number of missing values, then removing completely the remaining ones, and lastly correcting the imbalance distribution using downsampling method . For more detail about this data check my previous article.

options(warn=-1)
train <- read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/00421/aps_failure_training_set.csv", skip = 20)
test <- read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/00421/aps_failure_test_set.csv", skip = 20)

This data has 171 variables and a total of 76000 instances, 60000 in the training set, and 16000 in the testing sets. Notice that the missing values in the data are represented by lower case na values and not NA and hence are not recognized by R as missing values, That is why the read_csv function converted all the variables that have these values to the character type.

map_chr(train, typeof) %>% 
  tibble() %>% 
  table()

.
character    double 
      170         1

To fix this problem, therefore, we replace na by NA then converting back the corresponding variable to the numeric type.

train[-1] <- train[-1] %>% 
  modify(~replace(., .=="na", NA)) %>%
  modify(., as.double)

test[-1] <- test[-1] %>% 
  modify(~replace(., .=="na", NA)) %>%
  modify(., as.double)

Then we keep only the predictors that have less than 600 missing values, and thereafter, we keep only rows without missing values.

names <- modify(train[-1], is.na) %>% 
  colSums() %>%
  tibble(names = colnames(train[-1]), missing_values=.) %>% 
  filter(missing_values < 600) %>% 
  select(1)
train1 <- train[c("class",names$names)] %>% 
  .[complete.cases(.),]
test1 <- test[c("class",names$names)] %>% 
  .[complete.cases(.),]
dim(train1)

[1] 58888    11

dim(test1)

[1] 15728    11

As we see the data now has been reduced to 11 variables. The last thing to check is the distribution of the target variable.

prop.table(table(train1$class))


       neg        pos 
0.98376579 0.01623421

As we see, the data is highly imbalanced, so we correct this problem by downsamplig using the themis package. But first, we create a recipe that defines the formula of our model, normalizes the predictors, and down sample the data. Then we execute the recipe to the data using prep function, and lastly we retrieve the transformed data by juice function.

train2 <- recipe(class~., data=train1) %>%
  step_normalize(all_predictors()) %>% 
  step_downsample(class, seed = 111) %>%
  prep() %>% 
  juice()

For the testing set, however, we can transform it by the same recipe, but at the end we make use of the bake function instead of juice that uses the data defined in the recipe.

test2 <- recipe(class~., data=train1) %>%
  step_normalize(all_predictors()) %>%
  themis::step_downsample(class, seed = 111) %>% 
  prep() %>%
  bake(test1)

It should be noted here that the step_downsample is not needed for the testing set, that is why, the bake does not execute this step by default applied to any new data. We can check this by the dimension that still the same.

dim(test1); dim(test2)

[1] 15728    11

[1] 15728    11

Note: The testing set is not needed, since our purpose is the hyperparameter tuning methods, which requires only the training set. But it is included only to show how it is processed with the work flow of tidymodels package.

4 Random forest model

Since we are dealing with a classification problem, our objective function will be the area under the ROC curve roc_area. And for the model, we will use the most popular one, Random forest model with the two hyperparameters to tune:

mtry: The number of sampled predictors at each step.
min_n: The minimum number of instances in the node to split further.

4.1 The true distribution of the hyperparameters

With this small data, this model, in the sense of computational time, is not expensive, so we will try a wide range of values for the above hyperparameters, then we can use this as the true distribution to compare the performance of the hyperparameter tuning methods discussed above.

Using tidymodels package we define the model and we leave the mtry and min_n for tuning. To speed up, however, computation process we restrict the number of trees to 100 instead of the default fo 500 trees.

model_tune <- rand_forest(mtry = tune(), min_n = tune(), trees = 100L) %>% 
  set_engine("ranger", seed=222) %>% 
  set_mode("classification")

To avoid the pitfalls of random samples, we use the cross-validation technique.

set.seed(1234)
folds <- vfold_cv(train2, v=5, strata = class)

Then we use the following work flow:

tune_wf <- workflow() %>% 
  add_model(model_tune) %>% 
  add_formula(class~.)

Now we are ready to train the model using a wide range of combination values, and then we assume that these combinations are the true distribution of our model. Note that we have 100 combinations, so if the model takes hours to train, then it will require days for tuning. To prevent the model to rerun when rendering this document we save the results in csv file then we load it again. If you want, however, to run this model then uncomment the script

#tuned <- tune_wf %>% 
#   tune_grid(resamples = folds, 
#            grid =expand.grid(mtry=1:10, min_n=1:10),
#            metrics=metric_set(roc_auc))

To extract the results we use the collect_metrics function.

#df <- tuned %>% collect_metrics()
#write_csv(df, "C:/Users/dell/Documents/new-blog/content/bayes/hyper_bayes/df.csv")
df <- read_csv("C:/Users/dell/Documents/new-blog/content/bayes/hyper_bayes/df.csv")
df <- df %>% arrange(-mean) %>% tibble(rank=seq_len(nrow(df)), .)

df

# A tibble: 100 x 8
    rank  mtry min_n .metric .estimator  mean     n std_err
   <int> <dbl> <dbl> <chr>   <chr>      <dbl> <dbl>   <dbl>
 1     1     1     2 roc_auc binary     0.984     5 0.00198
 2     2     3     6 roc_auc binary     0.984     5 0.00204
 3     3     3     7 roc_auc binary     0.983     5 0.00269
 4     4     3     8 roc_auc binary     0.983     5 0.00304
 5     5     2    10 roc_auc binary     0.983     5 0.00200
 6     6     3     4 roc_auc binary     0.983     5 0.00246
 7     7     4     4 roc_auc binary     0.983     5 0.00224
 8     8     2     1 roc_auc binary     0.983     5 0.00170
 9     9     3     5 roc_auc binary     0.983     5 0.00271
10    10     4     5 roc_auc binary     0.983     5 0.00246
# ... with 90 more rows

We reached the maximum value for the objective function 0.9837565 with the following hyperparameter values mtry = 1, and min_n = 2.

Note: we will ignore the overfitting problem, since we are only making comparison using the same training set.

We can plot this distribution as follows:

scatter3D(x=df$mtry, y=df$min_n, z=df$mean, phi = 0, bty = "g",  type = "h", 
          ticktype = "detailed", pch = 19, cex = 0.5, 
          main="the true distribution", xlab="mtry",
          ylab="min_n", zlab="roc_auc")

4.2 random search

Now suppose that we are allowed to try only 10 combinations. For random search strategy, we use the grid_random function from dials package (embedded in tidymodels package).

#tuned_random <- tune_wf %>% 
#  tune_grid(resamples = folds, 
#            grid = grid_random(mtry(range = c(1,10)), min_n(range = c(1,10)), #size = 10),
#            metrics=metric_set(roc_auc))
#df_r <- tuned_random %>% collect_metrics()

#write_csv(df_r, "C:/Users/dell/Documents/new-blog/content/bayes/hyper_bayes/df_r.csv")
df_r <- read_csv("C:/Users/dell/Documents/new-blog/content/bayes/hyper_bayes/df_r.csv")

df_r %>% arrange(-mean) %>% 
  head(1) %>% 
  inner_join(df, by="mean") %>%
  select(rank, mtry.x, min_n.x, mean)

# A tibble: 1 x 4
   rank mtry.x min_n.x  mean
  <int>  <dbl>   <dbl> <dbl>
1    13      3      10 0.983

The maximum value obtained by this method is 0.9826900, which corresponds to the 13th rank compared to the maximum value of the the true distribution, and the associated hyperparameters values are mtry=3, min_n=10

4.3 bayesian optimization UCB

For the Bayesian optimization method, we make use of the tune_bayes function from tune package that provides all the acquisition functions discussed above. So we start by UCB function, and we set the argument kappa equals to 2, which controls the trade-off between exploitation and exploration such that larger values lead to more exploration. Notice that, by default, this function explores 10 combinations, if you want another number change the argument iter.

#tuned_UCB <- tune_wf %>% 
#  tune_bayes(resamples = folds,
#          param_info=parameters(mtry(range = c(1,10)), min_n(range = c(1,10))),
#             metrics=metric_set(roc_auc),
#             objective=conf_bound(kappa = 2))

#df_UCB <- tuned_UCB %>% collect_metrics()
#write_csv(df_UCB,"C:/Users/dell/Documents/new-blog/content/bayes/hyper_bayes/df_UCB.csv")
df_UCB <- read_csv("C:/Users/dell/Documents/new-blog/content/bayes/hyper_bayes/df_UCB.csv")

df_UCB %>% arrange(-mean) %>% 
  head(1) %>% 
  inner_join(df, by="mean") %>%
  select(rank, mtry.x, min_n.x, mean)

# A tibble: 1 x 4
   rank mtry.x min_n.x  mean
  <int>  <dbl>   <dbl> <dbl>
1     2      3       6 0.984

With this acquisition function, we obtained better result than the random search 0.9837529 , which occupies the second position.

4.4 bayesian optimization PI

This time we will explore the acquisition functions PI discussed above. Note that larger values for the argument trade-off lead to more exploration than exploitation.

#tuned_PI <- tune_wf %>% 
#  tune_bayes(resamples = folds,
#         param_info=parameters(mtry(range = c(1,10)), min_n(range = c(1,10))),
#             metrics=metric_set(roc_auc),
#             objective=prob_improve(trade_off = 0.01))

#df_PI <- tuned_PI %>% collect_metrics()
#write_csv(df_PI, "C:/Users/dell/Documents/new-blog/content/bayes/hyper_bayes/df_PI.csv")
df_PI <- read_csv("C:/Users/dell/Documents/new-blog/content/bayes/hyper_bayes/df_PI.csv")

df_PI %>% arrange(-mean) %>% 
  head(1) %>% 
  inner_join(df, by="mean") %>%
  select(rank, mtry.x, min_n.x, mean)

# A tibble: 1 x 4
   rank mtry.x min_n.x  mean
  <int>  <dbl>   <dbl> <dbl>
1    11      2       9 0.983

As we see, with this acquisition function, we obtained the 11th position which is more worst than the previous one, but still better than random search.

4.5 bayesian optimization EI

Now we try another acquisition function, the expected improvement function.

#tuned_EI <- tune_wf %>% 
#  tune_bayes(resamples = folds,
#             param_info=parameters(mtry(range = c(1,10)), 
#             min_n(range = c(1,10))),
#             metrics=metric_set(roc_auc),
#             objective=exp_improve(trade_off = 0.01))

#df_EI <- tuned_EI %>% collect_metrics()

#write_csv(df_EI, "C:/Users/dell/Documents/new-blog/content/bayes/hyper_bayes/df_EI.csv")
df_EI <- read_csv("C:/Users/dell/Documents/new-blog/content/bayes/hyper_bayes/df_EI.csv")

df_EI %>% arrange(-mean) %>% 
  head(1) %>% 
  inner_join(df, by="mean") %>%
  select(rank, mtry.x, min_n.x, mean)

# A tibble: 1 x 4
   rank mtry.x min_n.x  mean
  <int>  <dbl>   <dbl> <dbl>
1     2      3       6 0.984

Here also we get the same result as UCB method.

4.6 Contrast the results

df_rf <- tibble(names=c("true", "random", "UCB", "PI", "EI"),
                mtry=c(df[df$mean==max(df$mean),1, drop=TRUE],
                  df_r[df_r$mean==max(df_r$mean),1, drop=TRUE],
                     df_UCB[df_UCB$mean==max(df_UCB$mean),1, drop=TRUE],
                     df_PI[df_PI$mean==max(df_PI$mean),1, drop=TRUE],
                     df_EI[df_EI$mean==max(df_EI$mean),1, drop=TRUE]),
             min_n=c(df[df$mean==max(df$mean),2, drop=TRUE],
               df_r[df_r$mean==max(df_r$mean),2, drop=TRUE],
                     df_UCB[df_UCB$mean==max(df_UCB$mean),2, drop=TRUE],
                     df_PI[df_PI$mean==max(df_PI$mean),2, drop=TRUE],
                     df_EI[df_EI$mean==max(df_EI$mean),2, drop=TRUE]),
             roc_auc=c(max(df$mean), max(df_r$mean),max(df_UCB$mean),
                       max(df_PI$mean),max(df_EI$mean)),
             std_err=c(df[df$mean==max(df$mean),ncol(df), drop=TRUE],
                    df_r[df_r$mean==max(df_r$mean),ncol(df_r), drop=TRUE],
                     df_UCB[df_UCB$mean==max(df_UCB$mean),ncol(df_UCB), drop=TRUE],
                     df_PI[df_PI$mean==max(df_PI$mean),ncol(df_PI), drop=TRUE],
                     df_EI[df_EI$mean==max(df_EI$mean),ncol(df_EI), drop=TRUE]))
df_rf %>% arrange(-roc_auc)

# A tibble: 5 x 5
  names   mtry min_n roc_auc std_err
  <chr>  <dbl> <dbl>   <dbl>   <dbl>
1 true       1     1   0.984 0.00198
2 UCB        3     6   0.984 0.00204
3 EI         3     6   0.984 0.00204
4 PI         2     9   0.983 0.00290
5 random     3    10   0.983 0.00275

As we see, the Bayesian optimization method performs better than random search whatever acquisition function used. However, since the acquisition functions have their own hyperparameters (to trade-off between exploration and exploitation), their performance thus may differ strongly from each other. Moreover, the difference can be larger if we use a larger dataset.

5 deep learning model

As we did with the random forest model we explore a wide range of hyperparameter values assuming that these values represent the true distribution, then we apply all the above tuning methods. For the model architecture, we use a deep learning model with a single layer, and for tuning, we use the two hyperparameters, the number of nodes, and the number of epochs. The last thing we have to do is to save what we need in csv file and load it again since each time we rerun the model we get a different result, and besides, it takes a lot of time.

model_tune1 <- mlp(hidden_units = tune(), epochs = tune()) %>% 
  set_engine("keras") %>% 
  set_mode("classification")

tune_wf1 <- workflow() %>% 
  add_model(model_tune1) %>% 
  add_formula(class~.)

#tuned_deepl <- tune_wf1 %>% 
#  tune_grid(resamples = folds, 
#            grid = expand.grid(hidden_units=5:15, epochs=10:20),
#            metrics=metric_set(roc_auc))

#df1 <- tuned_deepl %>% collect_metrics()


#write_csv(df1, "C:/Users/dell/Documents/new-blog/content/bayes/hyper_bayes/df1.csv")

df1 <- read_csv("C:/Users/dell/Documents/new-blog/content/bayes/hyper_bayes/df1.csv")

df1 %>% arrange(-mean)

# A tibble: 121 x 7
   hidden_units epochs .metric .estimator  mean     n std_err
          <dbl>  <dbl> <chr>   <chr>      <dbl> <dbl>   <dbl>
 1           12     18 roc_auc binary     0.982     5 0.00255
 2           11     20 roc_auc binary     0.982     5 0.00265
 3            9     14 roc_auc binary     0.982     5 0.00274
 4            8     16 roc_auc binary     0.982     5 0.00207
 5           14     18 roc_auc binary     0.982     5 0.00241
 6           14     20 roc_auc binary     0.982     5 0.00244
 7           12     17 roc_auc binary     0.982     5 0.00265
 8           10     16 roc_auc binary     0.982     5 0.00230
 9           10     19 roc_auc binary     0.981     5 0.00231
10           12     20 roc_auc binary     0.981     5 0.00227
# ... with 111 more rows

With 12 units in the hidden layer and 18 epochs we reached the maximum value 0.9819806 for the area under the ROC curve.

scatter3D(x=df1$hidden_units, y=df1$epochs, z=df1$mean, 
          phi = 0, bty = "g",  type = "h", 
          ticktype = "detailed", pch = 19, cex = 0.5, 
          main="the true distribution", xlab="units",
          ylab="epochs", zlab="roc_auc")

5.1 Random search

#tuned_random1 <- tune_wf1 %>% 
#  tune_grid(resamples = folds, 
#            grid = grid_random(hidden_units(range = c(5,15)), epochs(range = #c(10,20)), size = 10),
#            metrics=metric_set(roc_auc))
#df_r1 <- tuned_random1 %>% collect_metrics()

#write_csv(df_r1, "C:/Users/dell/Documents/new-blog/content/bayes/hyper_bayes/df_#r1.csv")
df_r1 <- read_csv("C:/Users/dell/Documents/new-blog/content/bayes/hyper_bayes/df_r1.csv")

df_r1 %>% arrange(-mean)

# A tibble: 10 x 7
   hidden_units epochs .metric .estimator  mean     n std_err
          <dbl>  <dbl> <chr>   <chr>      <dbl> <dbl>   <dbl>
 1           14     11 roc_auc binary     0.982     5 0.00230
 2           14     17 roc_auc binary     0.981     5 0.00240
 3            8     15 roc_auc binary     0.981     5 0.00266
 4           11     19 roc_auc binary     0.981     5 0.00268
 5           14     16 roc_auc binary     0.981     5 0.00221
 6           10     12 roc_auc binary     0.980     5 0.00259
 7            7     13 roc_auc binary     0.980     5 0.00229
 8            5     16 roc_auc binary     0.980     5 0.00241
 9            9     10 roc_auc binary     0.980     5 0.00244
10            5     13 roc_auc binary     0.979     5 0.00270

Since deep learning models strongly depend on an internal random process (like when initializing the weights), and besides, the small difference between results, due to the small size of the data, it is harder to contrast between different methods. To alleviate this problem, however, we use this time the standard errors to highlight the significance differences between the methods.

5.2 Bayesian optimization UCB

#tuned_UCB1 <- tune_wf1 %>% 
#  tune_bayes(resamples = folds,
#             param_info=parameters(hidden_units(range = c(5L,15L)), 
#             epochs(range = c(10L,20L))),
#             metrics=metric_set(roc_auc),
#             objective=conf_bound(kappa = 2))

#df_UCB1 <- tuned_UCB1 %>% collect_metrics()

#write_csv(df_UCB1, "C:/Users/dell/Documents/new-blog/content/bayes/hyper_bayes/df_UCB1.csv")
df_UCB1 <- read_csv("C:/Users/dell/Documents/new-blog/content/bayes/hyper_bayes/df_UCB1.csv")

df_UCB1 %>% arrange(-mean)

# A tibble: 15 x 8
   hidden_units epochs .iter .metric .estimator  mean     n std_err
          <dbl>  <dbl> <dbl> <chr>   <chr>      <dbl> <dbl>   <dbl>
 1           12     19     8 roc_auc binary     0.981     5 0.00237
 2           12     17     0 roc_auc binary     0.981     5 0.00250
 3            9     19    10 roc_auc binary     0.981     5 0.00203
 4           12     18     2 roc_auc binary     0.981     5 0.00236
 5           11     17     3 roc_auc binary     0.981     5 0.00248
 6           13     19     9 roc_auc binary     0.981     5 0.00246
 7           13     18     6 roc_auc binary     0.981     5 0.00260
 8           13     17     1 roc_auc binary     0.981     5 0.00254
 9           10     19     0 roc_auc binary     0.981     5 0.00230
10           11     19     7 roc_auc binary     0.981     5 0.00221
11           13     15     0 roc_auc binary     0.981     5 0.00257
12           11     18     5 roc_auc binary     0.981     5 0.00269
13           12     16     4 roc_auc binary     0.981     5 0.00213
14            5     11     0 roc_auc binary     0.979     5 0.00233
15            9     12     0 roc_auc binary     0.979     5 0.00288

5.3 Bayesian optimization PI

#tuned_PI1 <- tune_wf1 %>% 
#  tune_bayes(resamples = folds,
#             param_info=parameters(hidden_units(range = c(5L,15L)), 
#             epochs(range = c(10L,20L))),
#             metrics=metric_set(roc_auc),
#             objective=prob_improve(trade_off = 0.01))

#df_PI1 <- tuned_PI1 %>% collect_metrics()

#write_csv(df_PI1, "C:/Users/dell/Documents/new-blog/content/bayes/hyper_bayes/df_PI1.csv")
df_PI1 <- read_csv("C:/Users/dell/Documents/new-blog/content/bayes/hyper_bayes/df_PI1.csv")

df_PI1 %>% arrange(-mean)

# A tibble: 15 x 8
   hidden_units epochs .iter .metric .estimator  mean     n std_err
          <dbl>  <dbl> <dbl> <chr>   <chr>      <dbl> <dbl>   <dbl>
 1            7     20     6 roc_auc binary     0.981     5 0.00236
 2           10     10     5 roc_auc binary     0.981     5 0.00237
 3            9     17     0 roc_auc binary     0.981     5 0.00216
 4           15     16     8 roc_auc binary     0.981     5 0.00213
 5           14     20     0 roc_auc binary     0.981     5 0.00251
 6           11     20     4 roc_auc binary     0.981     5 0.00282
 7           15     10     3 roc_auc binary     0.981     5 0.00272
 8            6     10     0 roc_auc binary     0.981     5 0.00286
 9           13     20    10 roc_auc binary     0.981     5 0.00253
10           12     14     0 roc_auc binary     0.981     5 0.00268
11            8     12     0 roc_auc binary     0.980     5 0.00220
12           13     10     7 roc_auc binary     0.980     5 0.00273
13            8     10     1 roc_auc binary     0.980     5 0.00305
14            5     20     2 roc_auc binary     0.980     5 0.00227
15            5     12     9 roc_auc binary     0.979     5 0.00234

5.4 Bayesian optimization EI

#tuned_EI1 <- tune_wf1 %>% 
#  tune_bayes(resamples = folds,
#             param_info=parameters(hidden_units(range = c(5L,15L)), 
#             epochs(range = c(10L,20L))),
#             metrics=metric_set(roc_auc),
#             objective=exp_improve(trade_off = 0.01))

#df_EI1 <- tuned_EI1 %>% collect_metrics()

#write_csv(df_EI1, "C:/Users/dell/Documents/new-blog/content/bayes/hyper_bayes/df_EI1.csv")
df_EI1 <- read_csv("C:/Users/dell/Documents/new-blog/content/bayes/hyper_bayes/df_EI1.csv")

df_EI1 %>% arrange(-mean)

# A tibble: 15 x 8
   hidden_units epochs .iter .metric .estimator  mean     n std_err
          <dbl>  <dbl> <dbl> <chr>   <chr>      <dbl> <dbl>   <dbl>
 1           10     20     0 roc_auc binary     0.981     5 0.00229
 2           15     18     1 roc_auc binary     0.981     5 0.00263
 3            7     16     0 roc_auc binary     0.981     5 0.00286
 4           12     20     8 roc_auc binary     0.981     5 0.00276
 5            8     19     6 roc_auc binary     0.981     5 0.00231
 6           14     14     0 roc_auc binary     0.981     5 0.00229
 7           13     14     7 roc_auc binary     0.981     5 0.00253
 8            5     20    10 roc_auc binary     0.981     5 0.00205
 9            6     13     5 roc_auc binary     0.981     5 0.00244
10           12     10     0 roc_auc binary     0.980     5 0.00266
11           11     14     4 roc_auc binary     0.980     5 0.00191
12           10     10     9 roc_auc binary     0.980     5 0.00284
13            5     12     0 roc_auc binary     0.980     5 0.00274
14            5     10     2 roc_auc binary     0.980     5 0.00261
15            9     14     3 roc_auc binary     0.979     5 0.00243

5.5 Contrast the results

df_deep <- tibble(names=c("true", "random", "UCB", "PI", "EI"),
                  hidden_units=c(df1[df1$mean==max(df1$mean),1, drop=TRUE],
                    df_r1[df_r1$mean==max(df_r1$mean),1, drop=TRUE],
                     df_UCB1[df_UCB1$mean==max(df_UCB1$mean),1, drop=TRUE],
                     df_PI1[df_PI1$mean==max(df_PI1$mean),1, drop=TRUE],
                     df_EI1[df_EI1$mean==max(df_EI1$mean),1, drop=TRUE]),
                  epochs=c(df1[df1$mean==max(df1$mean),2, drop=TRUE],
                    df_r1[df_r1$mean==max(df_r1$mean),2, drop=TRUE],
                     df_UCB1[df_UCB1$mean==max(df_UCB1$mean),2, drop=TRUE],
                     df_PI1[df_PI1$mean==max(df_PI1$mean),2, drop=TRUE],
                     df_EI1[df_EI1$mean==max(df_EI1$mean),2, drop=TRUE]),
                  roc_auc=c(max(df1$mean), max(df_r1$mean), max(df_UCB1$mean),
                       max(df_PI1$mean),max(df_EI1$mean)),
                  std_err=c(df1[df1$mean==max(df1$mean),ncol(df1), drop=TRUE],
                    df_r1[df_r1$mean==max(df_r1$mean),ncol(df_r1), drop=TRUE],
                     df_UCB1[df_UCB1$mean==max(df_UCB1$mean),ncol(df_UCB1), drop=TRUE],
                     df_PI1[df_PI1$mean==max(df_PI1$mean),ncol(df_PI1), drop=TRUE],
                     df_EI1[df_EI1$mean==max(df_EI1$mean),ncol(df_EI1), drop=TRUE]))
             
df_deep

# A tibble: 5 x 5
  names  hidden_units epochs roc_auc std_err
  <chr>         <dbl>  <dbl>   <dbl>   <dbl>
1 true             12     18   0.982 0.00255
2 random           14     11   0.982 0.00230
3 UCB              12     19   0.981 0.00237
4 PI                7     20   0.981 0.00236
5 EI               10     20   0.981 0.00229

ggplot(df_deep[-1, ]%>% arrange(-roc_auc), aes(x=names, y=roc_auc))+
         geom_point(size=2, color= "red")+
         geom_hline(yintercept = df_deep[df_deep$names=="true", 4, drop=TRUE], 
                    color= "blue", lty="dashed")+
         geom_errorbar(aes(ymin=roc_auc-1.92*std_err, ymax=roc_auc+1.92*std_err))

As we see, we do not obtain any significance difference between these methods, which is essentially due to the small size of the data and the small range used for the hyperparameter values. However, with more complex and large data set the difference will be obvious.

6 Conclusion

The main purpose of this article is more to show how implementing these methods in practice rather than highlighting the difference of performance between each other. However, the Bayesian optimization is known as the most efficient method, but with the downside that it also requires defining some hyperparameters like the acquisition function, sometimes arbitrarily. In other words, we cannot waste, say the limited budget, that should be concentrated to search for the optimum, to instead try different acquisition functions.

7 Session info

sessionInfo()

R version 4.0.1 (2020-06-06)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19041)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252 
[2] LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] plot3D_1.3       themis_0.1.2     yardstick_0.0.7  workflows_0.2.0 
 [5] tune_0.1.1       tidyr_1.1.2      tibble_3.0.3     rsample_0.0.8   
 [9] recipes_0.1.13   purrr_0.3.4      parsnip_0.1.3    modeldata_0.0.2 
[13] infer_0.5.3      ggplot2_3.3.2    dplyr_1.0.2      dials_0.0.9     
[17] scales_1.1.1     broom_0.7.1      tidymodels_0.1.1 readr_1.3.1     

loaded via a namespace (and not attached):
 [1] lubridate_1.7.9    doParallel_1.0.15  DiceDesign_1.8-1   tools_4.0.1       
 [5] backports_1.1.10   utf8_1.1.4         R6_2.4.1           rpart_4.1-15      
 [9] colorspace_1.4-1   nnet_7.3-14        withr_2.3.0        tidyselect_1.1.0  
[13] compiler_4.0.1     parallelMap_1.5.0  cli_2.0.2          labeling_0.3      
[17] bookdown_0.20      checkmate_2.0.0    stringr_1.4.0      digest_0.6.25     
[21] rmarkdown_2.4      unbalanced_2.0     pkgconfig_2.0.3    htmltools_0.5.0   
[25] lhs_1.1.0          rlang_0.4.7        rstudioapi_0.11    BBmisc_1.11       
[29] FNN_1.1.3          farver_2.0.3       generics_0.0.2     magrittr_1.5      
[33] ROSE_0.0-3         Matrix_1.2-18      Rcpp_1.0.5         munsell_0.5.0     
[37] fansi_0.4.1        GPfit_1.0-8        lifecycle_0.2.0    furrr_0.1.0       
[41] stringi_1.5.3      pROC_1.16.2        yaml_2.2.1         MASS_7.3-53       
[45] plyr_1.8.6         misc3d_0.9-0       grid_4.0.1         parallel_4.0.1    
[49] listenv_0.8.0      crayon_1.3.4       lattice_0.20-41    splines_4.0.1     
[53] hms_0.5.3          knitr_1.30         mlr_2.17.1         pillar_1.4.6      
[57] tcltk_4.0.1        codetools_0.2-16   fastmatch_1.1-0    glue_1.4.2        
[61] evaluate_0.14      ParamHelpers_1.14  blogdown_0.20      data.table_1.13.0 
[65] vctrs_0.3.4        foreach_1.5.0      gtable_0.3.0       RANN_2.6.1        
[69] future_1.19.1      assertthat_0.2.1   xfun_0.18          gower_0.2.2       
[73] prodlim_2019.11.13 class_7.3-17       survival_3.2-7     timeDate_3043.102 
[77] iterators_1.0.12   lava_1.6.8         globals_0.13.0     ellipsis_0.3.1    
[81] ipred_0.9-9

deep learning model for titanic data

Wed, 13 May 2020 00:00:00 +0000

Introduction

Deep learning model belongs to the area of machine learning models which can be used either for supervised or unsupervised learning. Based on artificial neural network, it can handle a wide variety of data types by using different neural network architectures such as recurrent neural network RNN for sequence data (time series, text data etc.), convolutional neural network CNN for computer vision, generative adversarial network GAN for image generation and many other types of architecture. The basic architecture of deep learning is the same as the classical artificial neural network (that has one hidden layer) with the difference that deep learning allows more than one hidden layer (this is where does the name deep come from ). Theses layers are called dense layers since that each node of a particular layer is connected with all the nodes of the previous layer, and in addition each node has an activation function to capture any nonlinearity in the data.

In this article, we will use the basic deep learning model to predict the famous titanic data set (kaggle competition).

Data preparation

We use the titanic data because of its familiarity with every one and hence focusing more on understanding and implementing our model. So Let’s call this data.

ssh <- suppressPackageStartupMessages
ssh(library(tidyverse))

## Warning: package 'ggplot2' was built under R version 4.0.2

## Warning: package 'tibble' was built under R version 4.0.2

## Warning: package 'tidyr' was built under R version 4.0.2

## Warning: package 'dplyr' was built under R version 4.0.2

data <- read_csv("C://Users/dell/Documents/new-blog/content/post/train.csv")

## Parsed with column specification:
## cols(
##   PassengerId = col_double(),
##   Survived = col_double(),
##   Pclass = col_double(),
##   Name = col_character(),
##   Sex = col_character(),
##   Age = col_double(),
##   SibSp = col_double(),
##   Parch = col_double(),
##   Ticket = col_character(),
##   Fare = col_double(),
##   Cabin = col_character(),
##   Embarked = col_character()
## )

Then we will call keras package for deep learning models, and caret for randomly spliting the data and creating the confusion matrix.

ssh(library(keras))
ssh(library(caret))

The first step in modeling is to clean and prepare the data. the following code shows the structure of this data.

glimpse(data)

## Rows: 891
## Columns: 12
## $ PassengerId <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, ...
## $ Survived    <dbl> 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0...
## $ Pclass      <dbl> 3, 1, 3, 1, 3, 3, 1, 3, 3, 2, 3, 1, 3, 3, 3, 2, 3, 2, 3...
## $ Name        <chr> "Braund, Mr. Owen Harris", "Cumings, Mrs. John Bradley ...
## $ Sex         <chr> "male", "female", "female", "female", "male", "male", "...
## $ Age         <dbl> 22, 38, 26, 35, 35, NA, 54, 2, 27, 14, 4, 58, 20, 39, 1...
## $ SibSp       <dbl> 1, 1, 0, 1, 0, 0, 0, 3, 0, 1, 1, 0, 0, 1, 0, 0, 4, 0, 1...
## $ Parch       <dbl> 0, 0, 0, 0, 0, 0, 0, 1, 2, 0, 1, 0, 0, 5, 0, 0, 1, 0, 0...
## $ Ticket      <chr> "A/5 21171", "PC 17599", "STON/O2. 3101282", "113803", ...
## $ Fare        <dbl> 7.2500, 71.2833, 7.9250, 53.1000, 8.0500, 8.4583, 51.86...
## $ Cabin       <chr> NA, "C85", NA, "C123", NA, NA, "E46", NA, NA, NA, "G6",...
## $ Embarked    <chr> "S", "C", "S", "S", "S", "Q", "S", "S", "S", "C", "S", ...

Using this data we want to predict the variable Survived using the remaining variables as predictors. We see that some variables have unique values such as PassengerId,Name, and ticket. Thus, they cannot be used as predictors. the same note applies to the variable Cabin with the additional problem of missing values. these variables will be removed as follows:

mydata<-data[,-c(1,4,9,11)]

As we see some variables should be of factor type such as Pclass (which is now double), Sex (character), and Embarked (character). thus, we convert them to factor type:

mydata <- mydata %>%  modify_at(c('Pclass', 'Embarked', 'Sex' ), as.factor)
glimpse(mydata)

## Rows: 891
## Columns: 8
## $ Survived <dbl> 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1...
## $ Pclass   <fct> 3, 1, 3, 1, 3, 3, 1, 3, 3, 2, 3, 1, 3, 3, 3, 2, 3, 2, 3, 3...
## $ Sex      <fct> male, female, female, female, male, male, male, male, fema...
## $ Age      <dbl> 22, 38, 26, 35, 35, NA, 54, 2, 27, 14, 4, 58, 20, 39, 14, ...
## $ SibSp    <dbl> 1, 1, 0, 1, 0, 0, 0, 3, 0, 1, 1, 0, 0, 1, 0, 0, 4, 0, 1, 0...
## $ Parch    <dbl> 0, 0, 0, 0, 0, 0, 0, 1, 2, 0, 1, 0, 0, 5, 0, 0, 1, 0, 0, 0...
## $ Fare     <dbl> 7.2500, 71.2833, 7.9250, 53.1000, 8.0500, 8.4583, 51.8625,...
## $ Embarked <fct> S, C, S, S, S, Q, S, S, S, C, S, S, S, S, S, S, Q, S, S, C...

Now let’s get some summary about this data

summary(mydata)

##     Survived      Pclass      Sex           Age            SibSp      
##  Min.   :0.0000   1:216   female:314   Min.   : 0.42   Min.   :0.000  
##  1st Qu.:0.0000   2:184   male  :577   1st Qu.:20.12   1st Qu.:0.000  
##  Median :0.0000   3:491                Median :28.00   Median :0.000  
##  Mean   :0.3838                        Mean   :29.70   Mean   :0.523  
##  3rd Qu.:1.0000                        3rd Qu.:38.00   3rd Qu.:1.000  
##  Max.   :1.0000                        Max.   :80.00   Max.   :8.000  
##                                        NA's   :177                    
##      Parch             Fare        Embarked  
##  Min.   :0.0000   Min.   :  0.00   C   :168  
##  1st Qu.:0.0000   1st Qu.:  7.91   Q   : 77  
##  Median :0.0000   Median : 14.45   S   :644  
##  Mean   :0.3816   Mean   : 32.20   NA's:  2  
##  3rd Qu.:0.0000   3rd Qu.: 31.00             
##  Max.   :6.0000   Max.   :512.33             
##

We have two variables that have missing values, Age with large number 177 , followed by Embarked with 2 missing values. To deal with this issue we have two options:

the first and easy one is to remove the entire rows that have any missing value but with the cost of may losing valuable information specially when we have large number of missing values compared to the total number of obervations as our case.
the second option is to impute this missing values using the other complete cases, for instance we can replace a missing value of a particular column by the mean of this column (for numeric variable) or we use multinomial method to predict the categorical variables.

Fortunately , there is a useful package called mice which will do this imputation for us. However, applying this imputation on the entire data would lead us to fall on a problem called train-test contamination ,which means that when we split the data , the missing values of the training set are imputed using cases in the test set, and this violates a crucial concept in machine learning for model evaluation, the test set should never be seen by the model during the training process.

To avoid this problem we apply the imputation separately on the training set and on the testing set. So let’s partition the data using caret package function.

Partition the data & impute the missing values.

we randomly split the data into two sets , 80% of samples will be used in the training process and the remaining 20% will be kept as test set.

set.seed(1234)
index<-createDataPartition(mydata$Survived,p=0.8,list=FALSE)
train<-mydata[index,]

## Warning: The `i` argument of ``[`()` can't be a matrix as of tibble 3.0.0.
## Convert to a vector.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.

test<-mydata[-index,]

Now we are ready to impute the missing values for both train and test set.

ssh(library(mice))

## Warning: package 'mice' was built under R version 4.0.2

impute_train<-mice(train,m=1,seed = 1111)
train<-complete(impute_train,1)

impute_test<-mice(test,m=1,seed = 1111)
test<-complete(impute_test,1)

Convert the data into a numeric matrix.

in deep learning all the variables should of numeric type, so first we convert the factors to integer type and recode the levels in order to start from 0, then we convert the data into matrix, and finally we pull out the target variable into a separate vector. We do this transformation for both sets (train and test).

train$Embarked<-as.integer(train$Embarked)-1
train$Sex<-as.integer(train$Sex)-1
train$Pclass<-as.integer(train$Pclass)-1

test$Embarked<-as.integer(test$Embarked)-1
test$Sex<-as.integer(test$Sex)-1
test$Pclass<-as.integer(test$Pclass)-1
glimpse(test)

## Rows: 178
## Columns: 8
## $ Survived <dbl> 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 1, 0...
## $ Pclass   <dbl> 2, 2, 2, 1, 1, 2, 2, 1, 2, 2, 2, 0, 0, 2, 2, 2, 1, 2, 2, 2...
## $ Sex      <dbl> 1, 1, 0, 0, 1, 0, 0, 0, 1, 1, 0, 1, 0, 1, 0, 0, 1, 1, 0, 1...
## $ Age      <dbl> 35.0, 2.0, 27.0, 55.0, 38.0, 23.0, 38.0, 3.0, 28.0, 34.5, ...
## $ SibSp    <dbl> 0, 3, 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 4, 5, 0, 0, 0, 0...
## $ Parch    <dbl> 0, 1, 2, 0, 0, 0, 5, 2, 0, 0, 0, 0, 0, 0, 2, 2, 0, 0, 0, 0...
## $ Fare     <dbl> 8.0500, 21.0750, 11.1333, 16.0000, 13.0000, 7.2250, 31.387...
## $ Embarked <dbl> 2, 2, 2, 2, 2, 0, 2, 0, 2, 1, 2, 2, 0, 2, 2, 2, 2, 2, 2, 2...

Note: If you noticed the varaibles Pclass, Embarked, and Sex, originally were numeric but we have converted them to factors for an appropriate imputation in the imputation step, if not doing so the imputation of Embarked missing values, for instance, could be any other numeric values which are not related to any ports in the data.

we convert the two sets into matrix form. (we also remove the column names)

trained<-as.matrix(train)
dimnames(trained)<-NULL

tested<-as.matrix(test)
dimnames(tested)<-NULL
str(tested)

##  num [1:178, 1:8] 0 0 1 1 1 1 1 1 0 0 ...

Now we pull out the target variable

trainy<-trained[,1]
testy<-tested[,1]
trainx<-trained[,-1]
testx<-tested[,-1]

Then we Apply one hot encoding on the target variable.

trainlabel<-to_categorical(trainy)
testlabel<-to_categorical(testy)

Train the model.

Now it is time to build our model. Th first step is to define the model architecture and the number of layers that will be used with the prespecified parameters. We will choose a simple model with one hidden layer with 10 unites (nodes). Since we have 7 predictors the input_shape will be 7, and the activation function is relu which is the most used one, but for the output layer we choose sigmoid function since we have binary classification.

Create the model

model <- keras_model_sequential()

model %>%
    layer_dense(units=10,activation = "relu",
              kernel_initializer = "he_normal",input_shape =c(7))%>%
    layer_dense(units=2,activation = "sigmoid")

summary(model)

## Model: "sequential"
## ________________________________________________________________________________
## Layer (type)                        Output Shape                    Param #     
## ================================================================================
## dense (Dense)                       (None, 10)                      80          
## ________________________________________________________________________________
## dense_1 (Dense)                     (None, 2)                       22          
## ================================================================================
## Total params: 102
## Trainable params: 102
## Non-trainable params: 0
## ________________________________________________________________________________

We have in total 102 parameters to estimate, since we have 7 inputs and 10 nodes and 10 biases, so the parameters number of the hidden layer is 80 (7*10+10). By the same way get the parameters number of the output layer.

Compile the model

In the compile function (from keras) we specify the loss function, the optimizer and the metric type that will be used. In our case we use the binary crossentropy, the optimizer is the popular one adam and for the metric we use accuracy.

model %>%
  compile(loss="binary_crossentropy",
          optimizer="adam",
          metric="accuracy")

Fit the model

Now we can run our model and follow the dynamic evolution of the process in the plot window on the right lower corner of the screen. and you can also plot the model in a static way. for our model we choose 100 epochs (iterations), for the stochastic gradient we use 20 samples at each iteration, and we hold out 20% of the training data to asses the model.

#history<- model %>%
# fit(trainx,trainlabel,epoch=100,batch_size=20,validation_split=0.2)

Note : if you would like to rerun the model uncomment the above code.

We can extract the five last metric values from the history object as follows.

#df <- tibble(train_loss=history$metrics$loss, valid_loss=history$metrics$val_loss,
#      train_acc=history$metrics$accuracy, valid_acc=history$metrics$val_accuracy)
#write_csv(df,"df.csv")
df <- read.csv("df.csv")
tail(df,5)

##     train_loss valid_loss train_acc valid_acc
## 96   0.4600244  0.4038978 0.7850877 0.8146853
## 97   0.4655294  0.4080083 0.7850877 0.8181818
## 98   0.4616975  0.4048636 0.7894737 0.8286713
## 99   0.4634421  0.4092717 0.7929825 0.8216783
## 100  0.4639769  0.4116935 0.7789474 0.8216783

It should be noted here that since the accuracy lines are more or less closer to each other and running together in the same direction we do not have to worry about overfitting, The opposite though is more pronounce since the accuracy of the training samples is less than that of the validation samples (underfitting), so we should increase the complexity of the model (by adding more nodes or more layers).

We can save this model (or save only the wieghts) and load it again for further use.

#save_model_hdf5(model,"simplemodel.h5")
model<-load_model_hdf5("simplemodel.h5")

The model evaluation

Let’s evaluate our model using both the training set then the testing set.

train_eva <- model %>%
  evaluate(trainx,trainlabel)
test_eva <- model %>% 
  evaluate(testx, testlabel) 
tibble(train_acc= train_eva[["accuracy"]], test_acc= test_eva[["accuracy"]], train_loss=train_eva[["loss"]],test_loss=test_eva[["loss"]])

The accuracy rate of the model using the test set is 80.89% which is higher than that of the training set (79.92%) which means that this model needs more improvement.

model tuning

Let’s now include another hidden layer with 20 nodes, and let’s also increase the number of epochs to 200. In addition, as we did with the above model we should save our optimal model.

model1 <- keras_model_sequential()

model1 %>%
    layer_dense(units=10,activation = "relu",
              kernel_initializer = "he_normal",input_shape =c(7)) %>%
    layer_dense(units=20, activation = "relu",
              kernel_initializer = "he_normal") %>%
    layer_dense(units=2,activation = "sigmoid")

model1 %>%
  compile(loss="binary_crossentropy",
          optimizer="adam",
          metric="accuracy")

#history1<- model1 %>%
#   fit (trainx,trainlabel,epoch=200,batch_size=40,validation_split=0.2)

Before evaluation we should save it.

#save_model_hdf5(model,"simplemodel1.h5")
model1<-load_model_hdf5("simplemodel1.h5")

Let’s evaluate this new model.

train_eva <- model1 %>%
  evaluate(trainx,trainlabel)
test_eva <- model1 %>% 
  evaluate(testx, testlabel)
tibble(train_acc= train_eva[["accuracy"]], test_acc= test_eva[["accuracy"]], train_loss=train_eva[["loss"]],test_loss=test_eva[["loss"]])

with this new model we get a larger improvement with both accuracies. We can go back again to our model and try to increase the nodes or the layers or playing around with other parameters to get better results.

Conclusion

Practically, deep learning models are more efficient than most of the classical machine learning models when it comes to fit complex and large data sets. Moreover, some type of data such as images or speeches are exclusively the areas where deep learning rises its great capability.

Time series with ARIMA and RNN models

Tue, 05 May 2020 00:00:00 +0000

1 Introduction

The classical methods for predicting univariate time series are ARIMA models (under linearity assumption and provided that the non stationarity is of type DS) that use the autocorrelation function (up to some order) to predict the target variable based on its own past values (Autoregressive part) and the past values of the errors (moving average part) in a linear function . However, the hardest step in ARIMA models is to derive stationary series from non stationary series that exhibits less well defined trend (deterministic or stochastic) or seasonality. The RNN model, proposed by John Hopfield (1982), is a deep learning model that does not need the above requirements (the type of non stationarity and linearity) and can capture and model the memory of the time series, which is the main characteristic of some type of sequence data, in addition to time series, such as text data, image captioning, speech recognition .. etc.

The basic idea behind RNN is very simple (As described in the plot below). At each time step t the model compute a state value $h_t$ that combines (in linear combination) the previous state $h_{t-1}$ (which contains all the memory available at time t-1 ) and the current input $x_t$ (which is the current value of the time series), passing then the result to the activation function tanh (to capture any nonlinearity relations). The state at each time step t thus can formally be expressed as follows:

\[h_t=tanh(W_h.h_{t-1}+W_x.x_t+b)\]

And then we leave the work to the gradient descent to decide how much memory we keep by computing the optimum weights $W_h$. Similarely, the output \[y_t\] will be computed by the following: \[y_t=W_y.h_t\]

img <- EBImage::readImage("C://Users/dell/Documents/new-blog/content/courses/rnn/rnn_plot.jpg")
plot(img)

2 Data preparation

First let’s call the packages needed for our analysis

ssh <- suppressPackageStartupMessages
ssh(library(timeSeries))
ssh(library(tseries))
ssh(library(aTSA))
ssh(library(forecast))

## Warning: package 'forecast' was built under R version 4.0.2

ssh(library(rugarch))

## Warning: package 'rugarch' was built under R version 4.0.2

ssh(library(ModelMetrics))
ssh(library(keras))

In this article we will use the data USDCHF from the timeSeries package which is the univariate series of the intraday foreign exchange rates between US dollar and Swiss franc with 62496 observations.

data(USDCHF)
length(USDCHF)

## [1] 62496

Let’s look at this data by the following plot after converting it to ts object.

data(USDCHF)
data <- ts(USDCHF, frequency = 365)
plot(data)

This series seems to have a trend and it is not stationary, but let’s verify this by the dickey fuller and philip perron tests

adf.test(data)
pp.test(data )

Both tests confirm that the data has unit roots(high p-value: we do not reject the null hypothesis). We can also check the correlogram of the autocorrelation function acf and the Partial autocorrelation function pacf as follows:

acf(data)

pacf(data)

As you know the ACF is related to the MA part and PACF to the AR part, so since in the pacf we have one bar that exceeds far away the confidence interval we are confident that our data has unit root and we can get ride of it by differencing the data by one. In ARIMA terms the data should be integrated by 1 (d=1), and this the I part of arima. In addition, since we do not have a decay of bars in PACF, the model would not have any lag included in the AR part. Whereas, from the ACF plot, all the bars are highly far from the confidence interval then the model would have many lags of MA part.

3 ARIMA model

To fit an ARIMA model we have to determine the lag of the AR (p) and MA(q) components and how many times we integrate the series to be stationary (d). Fortunately, we do not have to worry about these issues, we leave everything to the forcast package that provides a fast way to get the best model by calling the function auto.arima. But before that let’s held out the last 100 observations to be used as testing data in order to compare the quality of this model and the RNN model.

data_test <- data[(length(data)-99):length(data)]
data_train <- data[1:(length(data)-99-1)]

model_arima <- auto.arima(data_train)
summary(model_arima)

## Series: data_train 
## ARIMA(0,1,2) with drift 
## 
## Coefficients:
##           ma1     ma2  drift
##       -0.0193  0.0113      0
## s.e.   0.0040  0.0040      0
## 
## sigma^2 estimated as 2.29e-06:  log likelihood=316634.5
## AIC=-633260.9   AICc=-633260.9   BIC=-633224.8
## 
## Training set error measures:
##                        ME        RMSE          MAE           MPE       MAPE
## Training set 1.900607e-08 0.001513064 0.0009922846 -3.671242e-05 0.06627114
##                  MASE          ACF1
## Training set 0.999585 -3.921999e-05

As expected this model is an ARIMA(0,1,2) integrated by 1 (differenced series is now stationary) and has two MA lags without drift (constant). The output also has some metric values like Root mean square error RMSE and mean absolute error MAE which are the most popular ones. we will use later on this metric to compare this model with the RNN model. To validate this model we have to make sure that the residuals are white noise without any problems such as autocorrelation or heterskedasticity. Thankfully to forecast package we can check the residual straightforwardly by calling the function checkresiduals

checkresiduals(model_arima)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(0,1,2) with drift
## Q* = 8.6631, df = 7, p-value = 0.2778
## 
## Model df: 3.   Total lags used: 10

Since the p-value is far larger than the significance level 5% we do not reject the null hypothesis that the errors are not autocorrelated. However, by looking at the ACF plot we have some bars that go outside the confidence interval, but this can be expected by the significance level of 5% (as false positive). So we can confirm the non correlation with 95% of confidence. For possible heteroskedasticity we use ARCH_LM statistic from the package aTSA package.

arch.test(arima(data_train, order = c(0,1,2)))

We see that both test are highly significant (we reject the null hypothesis of homoskedasticity), so the above arima model is not able to capture such pattern. That is why we should join to the above model another model that keeps track of this type of patterns which is called GARCH model. The garch model attempts to model the residuals of the ARIMA model with the general following formula: \[\epsilon_t=w_t\sqrt{h_t}\] \[h_t=w_t\sqrt{a_0+\sum_{i=1}^{p}a_i.\epsilon_{t-i}^2+\sum_{j=1}^{q}b_j.h_{t-j}}\]

Where $w_t$ is white noise error.

So we fit this model for different lags by calling the function garch from the package tseries, and we use the AIC criterion to get the best model.

model <- character()
AIC <- numeric()
for (p in 1:5){
  for(q in 1:5){
    model_g <- tseries::garch(model_arima$residuals, order = c(p,q), trace=F)
    model<-c(model,paste("mod_", p, q))
    AIC <- c(AIC, AIC(model_g))
    def <- tibble::tibble(model,AIC)
  }
}

## Warning in tseries::garch(model_arima$residuals, order = c(p, q), trace = F):
## singular information

## Warning in tseries::garch(model_arima$residuals, order = c(p, q), trace = F):
## singular information

## Warning in tseries::garch(model_arima$residuals, order = c(p, q), trace = F):
## singular information

## Warning in tseries::garch(model_arima$residuals, order = c(p, q), trace = F):
## singular information

## Warning in tseries::garch(model_arima$residuals, order = c(p, q), trace = F):
## singular information

## Warning in tseries::garch(model_arima$residuals, order = c(p, q), trace = F):
## singular information

## Warning in tseries::garch(model_arima$residuals, order = c(p, q), trace = F):
## singular information

## Warning in tseries::garch(model_arima$residuals, order = c(p, q), trace = F):
## singular information

## Warning in tseries::garch(model_arima$residuals, order = c(p, q), trace = F):
## singular information

## Warning in tseries::garch(model_arima$residuals, order = c(p, q), trace = F):
## singular information

## Warning in tseries::garch(model_arima$residuals, order = c(p, q), trace = F):
## singular information

## Warning in tseries::garch(model_arima$residuals, order = c(p, q), trace = F):
## singular information

## Warning in tseries::garch(model_arima$residuals, order = c(p, q), trace = F):
## singular information

## Warning in tseries::garch(model_arima$residuals, order = c(p, q), trace = F):
## singular information

## Warning in tseries::garch(model_arima$residuals, order = c(p, q), trace = F):
## singular information

## Warning in tseries::garch(model_arima$residuals, order = c(p, q), trace = F):
## singular information

## Warning in tseries::garch(model_arima$residuals, order = c(p, q), trace = F):
## singular information

## Warning in tseries::garch(model_arima$residuals, order = c(p, q), trace = F):
## singular information

## Warning in tseries::garch(model_arima$residuals, order = c(p, q), trace = F):
## singular information

## Warning in tseries::garch(model_arima$residuals, order = c(p, q), trace = F):
## singular information

## Warning in tseries::garch(model_arima$residuals, order = c(p, q), trace = F):
## singular information

## Warning in tseries::garch(model_arima$residuals, order = c(p, q), trace = F):
## singular information

## Warning in tseries::garch(model_arima$residuals, order = c(p, q), trace = F):
## singular information

## Warning in tseries::garch(model_arima$residuals, order = c(p, q), trace = F):
## singular information

## Warning in tseries::garch(model_arima$residuals, order = c(p, q), trace = F):
## singular information

def %>% dplyr::arrange(AIC)

## # A tibble: 25 x 2
##    model         AIC
##    <chr>       <dbl>
##  1 mod_ 1 1 -647018.
##  2 mod_ 2 1 -647005.
##  3 mod_ 1 2 -647005.
##  4 mod_ 2 3 -646986.
##  5 mod_ 1 3 -646971.
##  6 mod_ 1 4 -646967.
##  7 mod_ 2 2 -646900.
##  8 mod_ 3 3 -646885.
##  9 mod_ 3 1 -646859.
## 10 mod_ 1 5 -646859.
## # ... with 15 more rows

As we see the simpler model with one lag for each component fit well the residuals we can check the residuals of this model with box test.

model_garch <- tseries::garch(model_arima$residuals, order = c(1,1), trace=F)

## Warning in tseries::garch(model_arima$residuals, order = c(1, 1), trace = F):
## singular information

Box.test(model_garch$residuals)

## 
##  Box-Pierce test
## 
## data:  model_garch$residuals
## X-squared = 3.1269, df = 1, p-value = 0.07701

With significance level of 5% we do not reject the null hypothesis of independence. As an alternative we can inspect the acf of the residuals.

acf(model_garch$residuals[-1])

The easiest way to get prediction from our model is by making use of the rugarch package. First, we specify the model with the parameters obtained above (the different lags)

# garch1 <- ugarchspec(mean.model = list(armaOrder = c(0,2), include.mean = FALSE), 
# variance.model = list(garchOrder = c(1,1)))

Then we use the function ugarchfit to predict our data_train. However, you might noticed that we supplied only the lags of the AR and MA parts of our ARIMA model (the d value for integration is not available in this function), so we should provide the differenced series of data_train instead of the original series.

Ddata_train <- diff(data_train)
# garchfit <- ugarchfit(data=Ddata_train, spec = garch1, solver = "gosolnp",trace=F)
# coef(garchfit)

Our final model will be written as follows.

\[y_t=e_t-4.296.10^{-2}e_{t-1}+5.687.10^{-3}e_{t-2} \\ e_t\sim N(0,\hat\sigma_t^2) \\ \hat\sigma_t^2=1.950.10^{-7}+2.565.10^{-1}e_{t-1}^2+6.940.10^{-1}\hat\sigma_{t-1}^2\]

NOTE: when running the above model we get different results due to the internal randomization process, that is why i commented the above code to prevent it to be rerun again when rendering this document.

Now we use this model for forecasting 100 future values to be compared then with the data_test values.

# fitted <- ugarchforecast(garchfit, n.ahead=100)
#yh_test<-numeric()
#for (i in 2:100){
#  yh_test[1] <- data_train[length(data_train)]+fitted(fitted)[1]
#  yh_test[i] <- yh_test[i-1]+fitted(fitted)[i]
#}
#df_eval <- tibble::tibble(y_test=data_test, yh_test=yh_test)
#df_eval

Finally we should save the df_eval table with the original and the fitted values of the data_test for further use.

#write.csv(df_eval, "df_eval.csv")

4 RNN model

As an alternative to ARIMA prediction method discussed above, the deep learning RNN method can also take into account the memory of the time series. Unlike the classical feedforward networks that process each single input independently, the RNN takes a bunch of inputs that supposed to be in one sequence and process them together as showed in the first plot. In keras this step can be achieved by layer_simple_rnn (Chollet, 2017, p167]. This means we have to decide the length of the sequence, in other words how far back we think that the current value is depending on (the memory of the time series). In our case we think that 7 days values should be satisfactory to predict the current value.

4.0.1 Reshape the time series

The first thing we do is organizing the data in such way that the model knows what part is considered as sequences to be processed by the rnn layer, and what part is the target variable. To do so we reorganize the time series into a matrix where each row is a single input , and the columns contain the lagged values (of the target variable) up to 7 and the target variable in the last column. Consequently, The total number of rows will be the length(data)-maxlen-1, where maxlen refers to the length of each sequences (constant) which here is equal to 7.

Let’s first create an empty matrix

maxlen <- 7
exch_matrix<- matrix(0, nrow = length(data_train)-maxlen-1, ncol = maxlen+1)

Now let’s move our time series to this matrix and display some rows to be sure that the output is as expected to be.

for(i in 1:(length(data_train)-maxlen-1)){
  exch_matrix[i,] <- data_train[i:(i+maxlen)]
}
head(exch_matrix)

##        [,1]   [,2]   [,3]   [,4]   [,5]   [,6]   [,7]   [,8]
## [1,] 1.1930 1.1941 1.1933 1.1931 1.1924 1.1926 1.1926 1.1932
## [2,] 1.1941 1.1933 1.1931 1.1924 1.1926 1.1926 1.1932 1.1933
## [3,] 1.1933 1.1931 1.1924 1.1926 1.1926 1.1932 1.1933 1.1932
## [4,] 1.1931 1.1924 1.1926 1.1926 1.1932 1.1933 1.1932 1.1933
## [5,] 1.1924 1.1926 1.1926 1.1932 1.1933 1.1932 1.1933 1.1934
## [6,] 1.1926 1.1926 1.1932 1.1933 1.1932 1.1933 1.1934 1.1940

Now we separate the inputs from the target.

x_train <- exch_matrix[, -ncol(exch_matrix)]
y_train <- exch_matrix[, ncol(exch_matrix)]

The rnn layer in keras expects the inputs to be of the shape (examples, maxlen, number of features), since then we have only one feature (our single time series that is processed sequentially) the shape of the inputs should be c(examples, 7,1). However, the first dimension can be discarded and we can provide only the last ones.

dim(x_train)

## [1] 62388     7

As we see this shape does not include the number of features, so we can correct it as follows.

x_train <- array_reshape(x_train, dim = c((length(data_train)-maxlen-1), maxlen, 1))
dim(x_train)

## [1] 62388     7     1

4.1 Model architecture

When it comes to deep learning models, there is a large space for hyperparameters to be defined and the results are heavily depending on these hyperparameters, such as the optimal number of layers, the optimal number of nodes in each layer, the suitable activation function, the suitable loss function, the best optimizer, the best regularization techniques, the best random initialization , …etc. Unfortunately, we do not have yet an exact rule to decide about these hyperparameters, and they depend on the problem under study, the data at hand, and the experience of the modeler. In our case, for instance, our data is very simple, and, actually does not require complex architecture, we will thus use only one hidden rnn layer with 10 nodes, the loss function will be the mean square error mse , the optimizer will be adam, and the metric will be the mean absolute error mae.

Note : with large and complex time series it might be needed to stack many rnn layers.

model <- keras_model_sequential()
model %>% 
  layer_dense(input_shape = dim(x_train)[-1], units=maxlen) %>% 
  layer_simple_rnn(units=10) %>% 
  layer_dense(units = 1)
summary(model)

## Model: "sequential"
## ________________________________________________________________________________
## Layer (type)                        Output Shape                    Param #     
## ================================================================================
## dense (Dense)                       (None, 7, 7)                    14          
## ________________________________________________________________________________
## simple_rnn (SimpleRNN)              (None, 10)                      180         
## ________________________________________________________________________________
## dense_1 (Dense)                     (None, 1)                       11          
## ================================================================================
## Total params: 205
## Trainable params: 205
## Non-trainable params: 0
## ________________________________________________________________________________

4.2 Model training

Now let’s compile and run the model with 5 epochs, batch_size of 32 instances at a time to update the weights, and to keep track of the model performance we held out 10% of the training data as validation set.

model %>% compile(
  loss = "mse",
  optimizer= "adam",
  metric = "mae" 
)

#history <- model %>% 
#  fit(x_train, y_train, epochs = 5, batch_size = 32, validation_split=0.1)

since each time we rerun the model we will get different results, so we should save the model (or only the model weights) and reload it again, doing so when rendering the document we will not be surprised by other outputs.

#save_model_hdf5(model, "rnn_model.h5")
rnn_model <- load_model_hdf5("rnn_model.h5")

4.3 Prediction

In order to get the prediction of the last 100 data point, we will predict the entire data then we compute the rmse for the last 100 predictions.

maxlen <- 7
exch_matrix2<- matrix(0, nrow = length(data)-maxlen-1, ncol = maxlen+1) 

for(i in 1:(length(data)-maxlen-1)){
  exch_matrix2[i,] <- data[i:(i+maxlen)]
}

x_train2 <- exch_matrix2[, -ncol(exch_matrix2)]
y_train2 <- exch_matrix2[, ncol(exch_matrix2)]

x_train2 <- array_reshape(x_train2, dim = c((length(data)-maxlen-1), maxlen, 1))

pred <- rnn_model %>% predict(x_train2)
df_eval_rnn <- tibble::tibble(y_rnn=y_train2[(length(y_train2)-99):length(y_train2)],
                          yhat_rnn=as.vector(pred)[(length(y_train2)-99):length(y_train2)])

5 results comparison

we can now compare the prediction of the last 100 data points using this model with the predicted values for the same data points using the ARIMA model. We first load the above data predicted with ARIMA model and join every thing in one data frame, then we use two metrics to compare, rmse, mae which are easily available in ModelMetrics package.

Note: You might want to ask why we only use 100 data points for predictions where usually, in machine learning, we use a large number sometimes 20% of the entire data. The answer is because of the nature of the ARIMA models which are a short term prediction models, especially with financial data that are characterized by the high and instable volatility (that is why we use garch model above).

df_eval <- read.csv("df_eval.csv")
rmse <- c(rmse(df_eval$y_test, df_eval$yh_test), 
          rmse(df_eval_rnn$y_rnn, df_eval_rnn$yhat_rnn) )
mae <- c(mae(df_eval$y_test, df_eval$yh_test), 
          mae(df_eval_rnn$y_rnn, df_eval_rnn$yhat_rnn) )
df <- tibble::tibble(model=c("ARIMA", "RNN"), rmse, mae)
df

## # A tibble: 2 x 3
##   model    rmse     mae
##   <chr>   <dbl>   <dbl>
## 1 ARIMA 0.00563 0.00388
## 2 RNN   0.00442 0.00401

As we see, The two models are closer to each other. However, if we use the rmse, which is the popular metrics used with continuous variables the rnn model is better, but with mae they are approximately the same.

6 Conclusion

Even though this data is very simple and does not need an RNN model, and it can be predicted with the classical ARIMA models, but it is used here for pedagogic purposes to well understand how the RNN works, and how the data should be processed to be ingested by keras. However, rnn model suffers from a major problem when running a large sequence known as Vanishing gradient and exploding gradient. In other words, with the former, when using the chain rule to compute the gradients, if the derivatives have small values then multiplying a large number of small values (as the length of the sequence) yields very tiny values that cause the network to be slowly trainable or even untrainable. The opposite is true when we face the latter problem, in this case we will get very large values and the network never converges.
Soon I will post an article with multivariate time series by implementing Long Short term memory LSTM model that is supposed to overcome the above problems that faces simple rnn model .

7 Further reading

Froncois Chollet, Deep learning with R, Meap edition, 2017, p167
Ian Godfollow et al, Deep Learning, http://www.deeplearningbook.org/

8 Session info

sessionInfo()

## R version 4.0.1 (2020-06-06)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 19041)
## 
## Matrix products: default
## 
## locale:
## [1] LC_COLLATE=English_United States.1252 
## [2] LC_CTYPE=English_United States.1252   
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C                          
## [5] LC_TIME=English_United States.1252    
## 
## attached base packages:
## [1] parallel  stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
## [1] keras_2.3.0.0        ModelMetrics_1.2.2.2 rugarch_1.4-4       
## [4] forecast_8.13        aTSA_3.1.2           tseries_0.10-47     
## [7] timeSeries_3062.100  timeDate_3043.102   
## 
## loaded via a namespace (and not attached):
##  [1] jsonlite_1.7.1              assertthat_0.2.1           
##  [3] TTR_0.24.2                  tiff_0.1-5                 
##  [5] yaml_2.2.1                  GeneralizedHyperbolic_0.8-4
##  [7] numDeriv_2016.8-1.1         pillar_1.4.6               
##  [9] lattice_0.20-41             reticulate_1.16            
## [11] glue_1.4.2                  quadprog_1.5-8             
## [13] DistributionUtils_0.6-0     digest_0.6.25              
## [15] colorspace_1.4-1            htmltools_0.5.0            
## [17] Matrix_1.2-18               pkgconfig_2.0.3            
## [19] bookdown_0.20               purrr_0.3.4                
## [21] fftwtools_0.9-9             mvtnorm_1.1-1              
## [23] scales_1.1.1                whisker_0.4                
## [25] jpeg_0.1-8.1                tibble_3.0.3               
## [27] farver_2.0.3                EBImage_4.30.0             
## [29] generics_0.0.2              ggplot2_3.3.2              
## [31] ellipsis_0.3.1              urca_1.3-0                 
## [33] nnet_7.3-14                 BiocGenerics_0.34.0        
## [35] cli_2.0.2                   quantmod_0.4.17            
## [37] magrittr_1.5                crayon_1.3.4               
## [39] mclust_5.4.6                evaluate_0.14              
## [41] ks_1.11.7                   fansi_0.4.1                
## [43] nlme_3.1-149                MASS_7.3-53                
## [45] xts_0.12.1                  truncnorm_1.0-8            
## [47] blogdown_0.20               tools_4.0.1                
## [49] data.table_1.13.0           lifecycle_0.2.0            
## [51] stringr_1.4.0               munsell_0.5.0              
## [53] locfit_1.5-9.4              compiler_4.0.1             
## [55] SkewHyperbolic_0.4-0        rlang_0.4.7                
## [57] grid_4.0.1                  RCurl_1.98-1.2             
## [59] nloptr_1.2.2.2              rappdirs_0.3.1             
## [61] htmlwidgets_1.5.2           Rsolnp_1.16                
## [63] labeling_0.3                base64enc_0.1-3            
## [65] spd_2.0-1                   bitops_1.0-6               
## [67] rmarkdown_2.4               gtable_0.3.0               
## [69] fracdiff_1.5-1              abind_1.4-5                
## [71] curl_4.3                    R6_2.4.1                   
## [73] tfruns_1.4                  zoo_1.8-8                  
## [75] tensorflow_2.2.0            knitr_1.30                 
## [77] dplyr_1.0.2                 utf8_1.1.4                 
## [79] zeallot_0.1.0               KernSmooth_2.23-17         
## [81] stringi_1.5.3               Rcpp_1.0.5                 
## [83] vctrs_0.3.4                 png_0.1-7                  
## [85] tidyselect_1.1.0            xfun_0.18                  
## [87] lmtest_0.9-38

Local Snsitivity Hashing Model

Tue, 28 Apr 2020 00:00:00 +0000

1 Introduction

This model is an approximate version of knn model which is difficult to be implemented with large data set. In contrast to knn model that looks for the exact number of nearest neighbours, this model looks for neighbours with high probabilities. Spark provides two methods to find out the approximate neighbours that depend on the data type at hand, Bucketed random projection and Minhash for jaccard distance.

The first method projects the data in lower dimension hash in which similar hashes indicate that the associated points (or observations) are close to each other. The mathematical basis of this technique is the following formula.

\[h^{x,b}(\vec\upsilon)=\lfloor \frac{\vec\upsilon.\vec x}{w}\rfloor\]

Where $h$ is the hashing function, $\vec\upsilon$ is the feature vector, $x$ is standard normal vector that has the same length, and $w$ is the bin width of the hashing bins, and the symbol $\lfloor \rfloor$ to coerce the result to be integer value. The idea is simple, we take the dot product of each feature vector with noisy vector, then the resulted projections (which are random) will be grouped into buckets, these buckets are supposed to include similar points. This process can be repeated many times with different noisy vector at each time to fine the similarity. For more detail about this technique click here

2 Data Preparation

For those who do not know much about sparklyr check my article introduction to sparklyr

First let’s call sparklyr and tidyverse packages, then we set the connection to spark and call the titanic data.

library(sparklyr, warn.conflicts = FALSE)

Warning: package 'sparklyr' was built under R version 4.0.2

library(tidyverse, warn.conflicts = FALSE)

Warning: package 'ggplot2' was built under R version 4.0.2

Warning: package 'tibble' was built under R version 4.0.2

Warning: package 'tidyr' was built under R version 4.0.2

Warning: package 'dplyr' was built under R version 4.0.2

sc <- spark_connect(master = "local")
mydata <- spark_read_csv(sc, "titanic", path = "C://Users/dell/Documents/new-blog/content/post/train.csv")
glimpse(mydata)

Rows: ??
Columns: 12
Database: spark_connection
$ PassengerId <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, ...
$ Survived    <int> 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0...
$ Pclass      <int> 3, 1, 3, 1, 3, 3, 1, 3, 3, 2, 3, 1, 3, 3, 3, 2, 3, 2, 3...
$ Name        <chr> "Braund, Mr. Owen Harris", "Cumings, Mrs. John Bradley ...
$ Sex         <chr> "male", "female", "female", "female", "male", "male", "...
$ Age         <dbl> 22, 38, 26, 35, 35, NaN, 54, 2, 27, 14, 4, 58, 20, 39, ...
$ SibSp       <int> 1, 1, 0, 1, 0, 0, 0, 3, 0, 1, 1, 0, 0, 1, 0, 0, 4, 0, 1...
$ Parch       <int> 0, 0, 0, 0, 0, 0, 0, 1, 2, 0, 1, 0, 0, 5, 0, 0, 1, 0, 0...
$ Ticket      <chr> "A/5 21171", "PC 17599", "STON/O2. 3101282", "113803", ...
$ Fare        <dbl> 7.2500, 71.2833, 7.9250, 53.1000, 8.0500, 8.4583, 51.86...
$ Cabin       <chr> NA, "C85", NA, "C123", NA, NA, "E46", NA, NA, NA, "G6",...
$ Embarked    <chr> "S", "C", "S", "S", "S", "Q", "S", "S", "S", "C", "S", ...

If wou notice this data is not large, but we intentially choose this data due to its familiarity and simplicity which make understanding the implementation of this model super easy. In other words, when we want to implement this model with very large data sets we repeat the same general basic steps.

Then we remove some varaibles that we think they are not much relevant for out puptose except for the PassengerId variable because we need it later (but we give it a shorter name).

newdata <- mydata %>% select(c(1, 2, 3, 5, 6, 7, 8, 10, 12)) %>% rename(id = PassengerId) %>% 
    glimpse()

Rows: ??
Columns: 9
Database: spark_connection
$ id       <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17,...
$ Survived <int> 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1...
$ Pclass   <int> 3, 1, 3, 1, 3, 3, 1, 3, 3, 2, 3, 1, 3, 3, 3, 2, 3, 2, 3, 3...
$ Sex      <chr> "male", "female", "female", "female", "male", "male", "mal...
$ Age      <dbl> 22, 38, 26, 35, 35, NaN, 54, 2, 27, 14, 4, 58, 20, 39, 14,...
$ SibSp    <int> 1, 1, 0, 1, 0, 0, 0, 3, 0, 1, 1, 0, 0, 1, 0, 0, 4, 0, 1, 0...
$ Parch    <int> 0, 0, 0, 0, 0, 0, 0, 1, 2, 0, 1, 0, 0, 5, 0, 0, 1, 0, 0, 0...
$ Fare     <dbl> 7.2500, 71.2833, 7.9250, 53.1000, 8.0500, 8.4583, 51.8625,...
$ Embarked <chr> "S", "C", "S", "S", "S", "Q", "S", "S", "S", "C", "S", "S"...

May be the first thing we do in explaratory analysis is to check the missing values.

newdata %>% mutate_all(is.na) %>% mutate_all(as.numeric) %>% summarise_all(sum)

# Source: spark<?> [?? x 9]
     id Survived Pclass   Sex   Age SibSp Parch  Fare Embarked
  <dbl>    <dbl>  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>    <dbl>
1     0        0      0     0   177     0     0     0        2

Since we have a large number of missing values it would be better to imput thes values rather than removing them. For the numeric variable Age we replace them by the median using the sparklyr function ft_imputer, and for categorical variable Embarked we use the most frequantly level which is here S port. But before this we should split the data first into training and testing sets to make sure that the testing set is completely isolated from the training set, then we impute each separately.

Since the data are a little bit imbalanced we randomly split the data separately with respect to the target variable Survived in order to preserve the same proportions of the Survived varaible levels as the original data, then we rebind the corresponding sets again.

data_surv <- newdata %>% filter(Survived == 1)
data_not <- newdata %>% filter(Survived == 0)
partition_surv <- data_surv %>% sdf_random_split(training = 0.8, test = 0.2, seed = 123)
partition_not <- data_not %>% sdf_random_split(training = 0.8, test = 0.2, seed = 123)
train <- sdf_bind_rows(partition_surv$training, partition_not$training) %>% ft_imputer(input_cols = "Age", 
    output_cols = "Age", strategy = "median") %>% na.replace(Embarked = "S") %>% 
    compute("train")
test <- sdf_bind_rows(partition_surv$test, partition_not$test) %>% ft_imputer(input_cols = "Age", 
    output_cols = "Age", strategy = "median") %>% na.replace(Embarked = "S") %>% 
    compute("test")

Not that we use compute function to cache the output into spark memory.

Before fitting any model the data must be processed in a way that can be consumed by the model. For our model, such as the most machine learning models, requires numeric features, we convert thus categorical variables to integers using the function ft_string_indexer, after that we convert them to dumy variables using the function ft_one hot_encoder_estimator, because the last function expects the inputs to be numeric.

For models build in sparklyr, the input variables should be stacked into one column vector on each other, this can be easily done by using the function ft_vector_assembler. However, this step does not prevent us to apply some other transformation even the features are in one column. For instance, to run efficiently our model we can transform the variables to be of the same scale, to do so we can either use standardization (sa we do here) or normalization method.

It is a good practice to save this preocessed set into the spark memory under an object name using the function compute

trained <- train %>% ft_string_indexer(input_col = "Sex", output_col = "Sex_indexed") %>% 
    ft_string_indexer(input_col = "Embarked", output_col = "Embarked_indexed") %>% 
    ft_one_hot_encoder_estimator(input_cols = c("Pclass", "Sex_indexed", "Embarked_indexed"), 
        output_cols = c("Pc_encod", "Sex_encod", "Emb_encod")) %>% ft_vector_assembler(input_cols = c("Pc_encod", 
    "Sex_encod", "Age", "SibSp", "Parch", "Fare", "Emb_encod"), output_col = "features") %>% 
    ft_standard_scaler(input_col = "features", output_col = "scaled", with_mean = TRUE) %>% 
    select(id, Survived, scaled) %>% compute("trained")

The same transformations above will be applied to the testing set test as follows.

tested <- test %>% ft_string_indexer(input_col = "Sex", output_col = "Sex_indexed") %>% 
    ft_string_indexer(input_col = "Embarked", output_col = "Embarked_indexed") %>% 
    ft_one_hot_encoder_estimator(input_cols = c("Pclass", "Sex_indexed", "Embarked_indexed"), 
        output_cols = c("Pc_encod", "Sex_encod", "Emb_encod")) %>% ft_vector_assembler(input_cols = c("Pc_encod", 
    "Sex_encod", "Age", "SibSp", "Parch", "Fare", "Emb_encod"), output_col = "features") %>% 
    ft_standard_scaler(input_col = "features", output_col = "scaled", with_mean = TRUE) %>% 
    select(id, Survived, scaled) %>% compute("tested")

Now we are ready to project the data on the lower dimension hash using the function ft_bucketed_random_projection_lsh with buckets of length 3 and 5 hash tables.

lsh_vector <- ft_bucketed_random_projection_lsh(sc, input_col = "scaled", output_col = "hash", 
    bucket_length = 3, num_hash_tables = 5, seed = 444)

To fit this model we feed the function ml_fit by the training data trained.

model_lsh <- ml_fit(lsh_vector, trained)

3 Prediction

At the prediction stage this model of classification gives us two alternatives for how we define the nearest neighbours:

define a threshold value from which we decide if two observations are considered as nearest neighbours or not, small value leads to take small number of neighbours. in sparklyr we can achive that using the function ml_approx_similarity_join and we specify the the threshold value for the minimum distance. the distance used by this function is the classical euclidien distance.
prespecify the number of the nearest neighbours regardeless of the distance between observations. This second alternative can be achieved using ml_approx_nearest_neighbors.

Each of which has its advantages and drawbacks depending on the problem at hand. for instance in medecine if you are more interested to check the similarities among patients at some level then the first option would be your choice but you may not be able to predict new cases that are not similar to any of the training cases constrained by this threshold value. In contrast, if your goal is to predict all your new cases then you would opt for the second option, but with the cost of including neighbours that are far a way constrained by the fixed number of neighbours. To better understand what hppens with each option, let’s use the following data.

suppressPackageStartupMessages(library(plotrix))
X <- c(55, 31, 35, 34, 15, 28, 8, 38, 35, 19, 27, 40, 39, 19, 66, 28, 42, 21, 18, 
    14, 40, 27, 3, 19, 21, 32, 13, 18, 7, 21, 49)
Y <- c(16, 18, 26, 13, 8.0292, 35.5, 21.075, 31.3875, 7.225, 263, 7.8958, 27.7208, 
    146.5208, 7.75, 10.5, 82.1708, 52, 8.05, 18, 11.2417, 9.475, 21, 41.5792, 7.8792, 
    8.05, 15.5, 7.75, 17.8, 39.6875, 7.8, 76.7292)
Z <- factor(c(1, 0, 0, 1, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 
    1, 0, 0, 1, 0, 0, 0, 1))
plot(X, Y, col = Z, ylim = c(0, 55), pch = 16)
points(x = 32, y = 20, col = "blue", pch = 8)
draw.circle(x = 32, y = 20, nv = 1000, radius = 6, lty = 2)
points(x = 55, y = 42, col = "green", pch = 8)
draw.circle(x = 55, y = 42, nv = 1000, radius = 6, lty = 2)

Using the fake data above to illustrate the difference between the two methods. Setting the threshold at 6 for the first option, we see the blue dot has 5 neighbours and this dot would be predicted as black using the majority vote. However, with this threshold the green dot does not have any neighbour around and hence it will be left without prediction.

plot(X, Y, pch = 16, col = Z, ylim = c(0, 55))
points(x = 32, y = 20, col = "blue", pch = 8)
points(x = 55, y = 42, col = "green", pch = 8)
draw.circle(x = 55, y = 42, nv = 1000, radius = 21.8, lty = 2)
draw.circle(x = 32, y = 20, nv = 1000, radius = 6, lty = 2)

In contrast to the above plot, using the second option, the green dot can be predicted as black since it has 5 neighbours in whcih 3 are block, but this prediction casts doubt about its quality since all the neighbours are far a way from the dot of interest, and this is the major drawback of this method.

Note: In fact we can overcome the drawbacks of each method by tuning the hyperparameters. To get predictions of all the new cases we can increase the distance threshold using the first method so that all the cases will be predicted (but we may lose accuracy if we have any single outlier). And we can reduce the number for the nearest neighbours number to get meaningful similarities (but we may lose accuracy with dots spread out from each other).

3.1 Similarity based on distance

To show the neighbours of each point we use the function ml_approx_similarity_join provided that the data has an id column, this is thus the reason why we have created this id before.

approx_join <- ml_approx_similarity_join(model_lsh, trained, trained, threshold = 1, 
    dist_col = "dist")
approx_join

# Source: spark<?> [?? x 3]
    id_a  id_b      dist
   <int> <int>     <dbl>
 1     2   376 0.813    
 2    11    11 0        
 3    16   773 0.189    
 4    16   707 0.787    
 5    20   368 0.0000809
 6    23   290 0.550    
 7    23   157 0.0787   
 8    24   873 0.707    
 9    24   448 0.502    
10    24    84 0.224    
# ... with more rows

This function joined the data trained with itself to get the similar observations. The threshold determines the value from which we consider two observations as similar. In othe words, cases that has dist value less than 1 will be similar. let’s for instance pick up some similar observations and check out how they are similar.

train %>% filter(id %in% c(29, 654, 275, 199, 45))

# Source: spark<?> [?? x 9]
     id Survived Pclass Sex      Age SibSp Parch  Fare Embarked
  <int>    <int>  <int> <chr>  <dbl> <int> <int> <dbl> <chr>   
1    29        1      3 female    28     0     0  7.88 Q       
2    45        1      3 female    19     0     0  7.88 Q       
3   199        1      3 female    28     0     0  7.75 Q       
4   275        1      3 female    28     0     0  7.75 Q       
5   654        1      3 female    28     0     0  7.83 Q

As we see all these passengers are all survived females in the same class (third class) without children or parents or siblings embarked from the same port, approximately paied the same ticket price, have the same age (except for 45 with age 19), so they are higly likely to be friends traveling togather. To predict the test set tested we use the function ml_predict, then we extrat the similarities with the fuction ml_approx_similarity_join.

hashed <- ml_predict(model_lsh, tested) %>% ml_approx_similarity_join(model_lsh, 
    trained, ., threshold = 1, dist_col = "dist")
hashed

# Source: spark<?> [?? x 3]
    id_a  id_b  dist
   <int> <int> <dbl>
 1    12   863 0.904
 2    16   459 0.557
 3    29   728 0.266
 4    29    33 0.265
 5    37   245 0.479
 6    45    33 0.788
 7    48   728 0.265
 8    48   369 0.265
 9    48   187 0.848
10    54   519 0.564
# ... with more rows

we can now shoose a particular person, say id_b=33, and then find his/her similar persons in the training set. By using the majority vote we decide if that person is survived or not.

m <- 33
ids_train <- hashed %>% filter(id_b == m) %>% pull(id_a)
df1 <- train %>% filter(id %in% ids_train)
df2 <- test %>% filter(id == m)
df <- sdf_bind_rows(df1, df2)
df

# Source: spark<?> [?? x 9]
      id Survived Pclass Sex      Age SibSp Parch  Fare Embarked
   <int>    <int>  <int> <chr>  <dbl> <int> <int> <dbl> <chr>   
 1    29        1      3 female    28     0     0  7.88 Q       
 2    45        1      3 female    19     0     0  7.88 Q       
 3    48        1      3 female    28     0     0  7.75 Q       
 4    83        1      3 female    28     0     0  7.79 Q       
 5   199        1      3 female    28     0     0  7.75 Q       
 6   275        1      3 female    28     0     0  7.75 Q       
 7   290        1      3 female    22     0     0  7.75 Q       
 8   301        1      3 female    28     0     0  7.75 Q       
 9   360        1      3 female    28     0     0  7.88 Q       
10   574        1      3 female    28     0     0  7.75 Q       
# ... with more rows

The last row in this table contains our test instance 33, and it has 17 neighbours from the training data with mixture of died and survived persons.

df %>% filter(id != m) %>% select(Survived) %>% collect() %>% table()

## .
##  0  1 
##  5 12

Using the majority vote this person will be classified as survived since the non survived persons number (5) is less than survived persons number (12), and hence this person is correctly classified.

3.2 Similarity based on the number of nearest neighbours

Using the same above steps but here with the function

ml_approx_nearest_neighbors we can predict any point. for example let’s take our previous passenger 120 in the testing set. But first we have to extract the values related to this person from the transformed testing set tested.

id_input <- tested %>% filter(id == m) %>% pull(scaled) %>% unlist()
id_input

##  [1]  0.00000000 -0.56054485 -0.50652969 -1.42132034 -0.07744921 -0.49874843
##  [7] -0.47508853 -0.54740838 -1.79973402 -0.41903250

These are the values of all the standardized vectors in the column scaled that will be used to get its closest neighbours in the training data, and here we specify the number of neighbours to be 7.

knn <- ml_approx_nearest_neighbors(model_lsh, trained, key = id_input, dist_col = "dist", 
    num_nearest_neighbors = 7)
knn

# Source: spark<?> [?? x 5]
     id Survived scaled     hash        dist
  <int>    <int> <list>     <list>     <dbl>
1   698        1 <dbl [10]> <list [5]> 0.265
2    48        1 <dbl [10]> <list [5]> 0.265
3   275        1 <dbl [10]> <list [5]> 0.265
4   199        1 <dbl [10]> <list [5]> 0.265
5   301        1 <dbl [10]> <list [5]> 0.265
6   574        1 <dbl [10]> <list [5]> 0.265
7   265        0 <dbl [10]> <list [5]> 0.265

Theses are the neighbours of our passenger with thir id’s. We can get the fraction of surived ones as follows.

n <- sdf_nrow(knn)
pred <- knn %>% select(Survived) %>% summarise(p = sum(Survived)/n)
pred

## # Source: spark<?> [?? x 1]
##       p
##   <dbl>
## 1 0.857

Since this probability is greater than 0.5 we predict this passenger as survived, and here also is correctly classified. however, in some cases we can get different predictions.

To get the accuracy of the whole testing set, we use the following for loop, which requires a lot of computing time since at the end of each iteration we collect the results into R. Consequently it will not usefull with large dataset.

mypred <- numeric(0)
M <- tested %>% collect() %>% .$id
for (i in M) {
    id_input <- tested %>% filter(id == i) %>% pull(scaled) %>% unlist()
    knn <- ml_approx_nearest_neighbors(model_lsh, trained, key = id_input, dist_col = "dist", 
        num_nearest_neighbors = 7)
    n <- sdf_nrow(knn)
    pred <- knn %>% select(Survived) %>% summarise(p = sum(Survived)/n) %>% collect()
    mypred <- rbind(mypred, pred)
}
mypred

# A tibble: 200 x 1
       p
   <dbl>
 1 0.286
 2 1    
 3 0.571
 4 0    
 5 0.143
 6 0.857
 7 0.429
 8 1    
 9 1    
10 0.857
# ... with 190 more rows

Now first we convert the probabilities into class labels, next we join this data frame with the testing data, and finally we use the function confusionmatrix from caret package.

tested_R <- tested %>% select(Survived) %>% collect()
new <- cbind(mypred, tested_R) %>% mutate(predicted = ifelse(p > 0.5, "1", "0"))
caret::confusionMatrix(as.factor(new$Survived), as.factor(new$predicted))

Confusion Matrix and Statistics

          Reference
Prediction   0   1
         0 109  12
         1  30  49
                                          
               Accuracy : 0.79            
                 95% CI : (0.7269, 0.8443)
    No Information Rate : 0.695           
    P-Value [Acc > NIR] : 0.001704        
                                          
                  Kappa : 0.5425          
                                          
 Mcnemar's Test P-Value : 0.008712        
                                          
            Sensitivity : 0.7842          
            Specificity : 0.8033          
         Pos Pred Value : 0.9008          
         Neg Pred Value : 0.6203          
             Prevalence : 0.6950          
         Detection Rate : 0.5450          
   Detection Prevalence : 0.6050          
      Balanced Accuracy : 0.7937          
                                          
       'Positive' Class : 0

The accuracy rate is pretty good with 79%.

Finally, do not forget to dsiconnect when your work is completed.

spark_disconnect(sc)

4 Conclusion

The LSH model is an approximation of knn when we have large dataset. We could increase the model performance by playing around with the threshold value or the number of the neighbours.

5 Further reading

6 Session information

sessionInfo()

## R version 4.0.1 (2020-06-06)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 19041)
## 
## Matrix products: default
## 
## locale:
## [1] LC_COLLATE=English_United States.1252 
## [2] LC_CTYPE=English_United States.1252   
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C                          
## [5] LC_TIME=English_United States.1252    
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] plotrix_3.7-8   forcats_0.5.0   stringr_1.4.0   dplyr_1.0.2    
##  [5] purrr_0.3.4     readr_1.3.1     tidyr_1.1.2     tibble_3.0.3   
##  [9] ggplot2_3.3.2   tidyverse_1.3.0 sparklyr_1.4.0 
## 
## loaded via a namespace (and not attached):
##  [1] nlme_3.1-149         fs_1.5.0             lubridate_1.7.9     
##  [4] httr_1.4.2           rprojroot_1.3-2      tools_4.0.1         
##  [7] backports_1.1.10     utf8_1.1.4           R6_2.4.1            
## [10] rpart_4.1-15         DBI_1.1.0            colorspace_1.4-1    
## [13] nnet_7.3-14          withr_2.3.0          tidyselect_1.1.0    
## [16] compiler_4.0.1       cli_2.0.2            rvest_0.3.6         
## [19] formatR_1.7          forge_0.2.0          xml2_1.3.2          
## [22] bookdown_0.20        scales_1.1.1         askpass_1.1         
## [25] digest_0.6.25        rmarkdown_2.4        base64enc_0.1-3     
## [28] pkgconfig_2.0.3      htmltools_0.5.0      dbplyr_1.4.4        
## [31] htmlwidgets_1.5.2    rlang_0.4.7          readxl_1.3.1        
## [34] rstudioapi_0.11      generics_0.0.2       jsonlite_1.7.1      
## [37] ModelMetrics_1.2.2.2 config_0.3           magrittr_1.5        
## [40] Matrix_1.2-18        Rcpp_1.0.5           munsell_0.5.0       
## [43] fansi_0.4.1          lifecycle_0.2.0      pROC_1.16.2         
## [46] stringi_1.5.3        yaml_2.2.1           MASS_7.3-53         
## [49] plyr_1.8.6           recipes_0.1.13       grid_4.0.1          
## [52] blob_1.2.1           parallel_4.0.1       crayon_1.3.4        
## [55] lattice_0.20-41      haven_2.3.1          splines_4.0.1       
## [58] hms_0.5.3            knitr_1.30           pillar_1.4.6        
## [61] uuid_0.1-4           stats4_4.0.1         reshape2_1.4.4      
## [64] codetools_0.2-16     reprex_0.3.0         glue_1.4.2          
## [67] evaluate_0.14        blogdown_0.20        data.table_1.13.0   
## [70] modelr_0.1.8         vctrs_0.3.4          foreach_1.5.0       
## [73] cellranger_1.1.0     gtable_0.3.0         openssl_1.4.3       
## [76] assertthat_0.2.1     r2d3_0.2.3           xfun_0.18           
## [79] gower_0.2.2          prodlim_2019.11.13   broom_0.7.1         
## [82] e1071_1.7-3          class_7.3-17         survival_3.2-7      
## [85] timeDate_3043.102    iterators_1.0.12     lava_1.6.8          
## [88] ellipsis_0.3.1       caret_6.0-86         ipred_0.9-9

Bayesian linear regression

Sat, 25 Apr 2020 00:00:00 +0000

1 Introduction

For statistical inferences we have tow general approaches or frameworks:

Frequentist approach in which the data sampled from the population is considered as random and the population parameter values, known as null hypothesis, as fixed (but unknown). To estimate thus this null hypothesis we look for the sample parameters that maximize the likelihood of the data. However, the data at hand, even it is sampled randomly from the population, it is fixed now, so how can we consider this data as random. The answer is that we assume that the population distribution is known and we work out the maximum likelihood of the data using this distribution. Or we repeat the study many times with different samples then we average the results. So if we get very small value for the likelihood of the data which is known as p-value we tend to reject the null hypothesis. The main problem, however, is the misunderstanding and misusing of this p-value when we decide to reject the null hypothesis based on some threshold, from which we wrongly interpreting it as the probability of rejecting the null hypothesis. For more detail about p-value click here.
Bayesian approach, in contrast, provides true probabilities to quantify the uncertainty about a certain hypothesis, but requires the use of a first belief about how likely this hypothesis is true, known as prior, to be able to derive the probability of this hypothesis after seeing the data known as posterior probability. This approach called bayesian because it is based on the bayes’ theorem, for instance if a have population parameter to estimate $\theta$ , and we have some data sampled randomly from this population $D$, the posterior probability thus will be \[\overbrace{p(\theta/D)}^{Posterior}=\frac{\overbrace{p(D/\theta)}^{Likelihood}.\overbrace{p(\theta)}^{Prior}}{\underbrace{p(D)}_{Evidence}}\] The Evidence is the probability of the data at hand regardless the parameter $\theta$.

2 Data preparation

For simplicity we use the BostonHousing data from mlbench package, For more detail about this data run this command ?BostonHousing after calling the package. But first Let’s call all the packages that we need throughout this article.

options(warn = -1)
library(mlbench)
library(rstanarm)

## Loading required package: Rcpp

## This is rstanarm version 2.21.1

## - See https://mc-stan.org/rstanarm/articles/priors for changes to default priors!

## - Default priors may change, so it's safest to specify priors, even if equivalent to the defaults.

## - For execution on a local, multicore CPU with excess RAM we recommend calling

##   options(mc.cores = parallel::detectCores())

library(bayestestR)
library(bayesplot)

## This is bayesplot version 1.7.2

## - Online documentation and vignettes at mc-stan.org/bayesplot

## - bayesplot theme set to bayesplot::theme_default()

##    * Does _not_ affect other ggplot2 plots

##    * See ?bayesplot_theme_set for details on theme setting

library(insight)
library(broom)

data("BostonHousing")
str(BostonHousing)

## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : num  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ b      : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...

To well understand how the Bayesian regression works we keep only three features, two numeric variables age, dis and one categorical chas, with the target variable medv the median value of owner-occupied homes.

bost <- BostonHousing[, c("medv", "age", "dis", "chas")]
summary(bost)

##       medv            age              dis         chas   
##  Min.   : 5.00   Min.   :  2.90   Min.   : 1.130   0:471  
##  1st Qu.:17.02   1st Qu.: 45.02   1st Qu.: 2.100   1: 35  
##  Median :21.20   Median : 77.50   Median : 3.207          
##  Mean   :22.53   Mean   : 68.57   Mean   : 3.795          
##  3rd Qu.:25.00   3rd Qu.: 94.08   3rd Qu.: 5.188          
##  Max.   :50.00   Max.   :100.00   Max.   :12.127

From the summary we do not have any special issues like missing values for example.

3 Classical linear regression model

To highlight the difference between the bayesian regression and the traditional linear regression (frequentist approach), Let’s first fit the latter to our data.

model_freq <- lm(medv ~ ., data = bost)
tidy(model_freq)

## # A tibble: 4 x 5
##   term        estimate std.error statistic  p.value
##   <chr>          <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)   32.7      2.25      14.6   2.33e-40
## 2 age           -0.143    0.0198    -7.21  2.09e-12
## 3 dis           -0.246    0.265     -0.928 3.54e- 1
## 4 chas1          7.51     1.46       5.13  4.16e- 7

Using the p.value of each regressor, all the regressors ar significant. except for the dis variable. Since the variable chas is categorical with twolevels The coefficient of chas1 is the different between the madian price of houses on the bounds charles River and that of the others, so the median price of the former are higher about 7.513.

4 Bayesian regression

To fit a bayesian regresion we use the function stan_glm from the rstanarm package. This function as the above lm function requires providing the formula and the data that will be used, and leave all the following arguments with their default values:

family : by default this function uses the gaussian distribution as we do with the classical glm function to perform lm model.
prior : The prior distribution for the regression coefficients, By default the normal prior is used. There are subset of functions used for the prior provided by rstanarm like , student t family, laplace family…ect. To get the full list with all the details run this command ?priors. If we want a flat uniform prior we set this to NULL.
prior_intercept: prior for the intercept, can be normal, student_t , or cauchy. If we want a flat uniform prior we set this to NULL.
prior_aux: prior fo auxiliary parameters such as the error standard deviation for the gaussion family.
algorithm: The estimating approach to use. The default is "sampling MCMC¹.
QR: FALSE by default, if true QR decomposition applied on the design matrix if we have large number of predictors.
iter : is the number of iterations if the MCMC method is used, the default is 2000.
chains : the number of Markov chains, the default is 4.
warmup : also known as burnin, the number of iterations used for adaptation, and should not be used for inference. By default it is half of the iterations.

model_bayes <- stan_glm(medv ~ ., data = bost, seed = 111)

if we print the model we get the following

print(model_bayes, digits = 3)

## stan_glm
##  family:       gaussian [identity]
##  formula:      medv ~ .
##  observations: 506
##  predictors:   4
## ------
##             Median MAD_SD
## (Intercept) 32.834  2.285
## age         -0.143  0.020
## dis         -0.258  0.257
## chas1        7.543  1.432
## 
## Auxiliary parameter(s):
##       Median MAD_SD
## sigma 8.324  0.260 
## 
## ------
## * For help interpreting the printed output see ?print.stanreg
## * For info on the priors used see ?prior_summary.stanreg

The Median estimate is the median computed from the MCMC simulation, and MAD_SD is the median absolute deviation computed from the same simulation. To well understand how getting these outputs let’s plot the MCMC simulation of each predictor using bayesplot

mcmc_dens(model_bayes, pars = c("age")) + vline_at(-0.143, col = "red")

As you see the point estimate of age falls on the median of this distribution (red line). The same thing is true for dis and shas predictors.

mcmc_dens(model_bayes, pars = c("chas1")) + vline_at(7.496, col = "red")

mcmc_dens(model_bayes, pars = c("dis")) + vline_at(-0.244, col = "red")

Now how can we evaluate the model parameters? The answer is by analyzing the posteriors using some specific statistics. To get the full statistics provided by bayestestR package, we make use of the function describe_posterior.

describe_posterior(model_bayes)

## Possible multicollinearity between dis and age (r = 0.76). This might lead to inappropriate results. See 'Details' in '?rope'.

## # Description of Posterior Distributions
## 
## Parameter   | Median |           89% CI |    pd |        89% ROPE | % in ROPE |  Rhat |      ESS
## ------------------------------------------------------------------------------------------------
## (Intercept) | 32.834 | [29.218, 36.295] | 1.000 | [-0.920, 0.920] |         0 | 1.002 | 2029.279
## age         | -0.143 | [-0.175, -0.112] | 1.000 | [-0.920, 0.920] |       100 | 1.001 | 2052.155
## dis         | -0.258 | [-0.667,  0.179] | 0.819 | [-0.920, 0.920] |       100 | 1.002 | 2115.192
## chas1       |  7.543 | [ 5.159,  9.813] | 1.000 | [-0.920, 0.920] |         0 | 1.000 | 3744.403

Before starting analyzing the table we shoud first understanding the above various statistics commonly used in bayes regression.

CI : Credible Interval, it is used to quantify the uncertainty about the regression coefficients. Ther are tow methods to compute CI, the highest density interval HDI which is the default, and the Equal-tailed Interval ETI. with 89% probability (given the data) that a coefficient lies above the CI_low value and under CI_high value. This strightforward probabilistic interpretation is completely diffrent from the confidence interval used in classical linear regression where the coefficient fall inside this confidence interval (if we choose 95% of confidence) 95 times if we repeat the study 100 times.
pd : Probability of Direction , which is the probability that the effect goes to the positive or to the negative direction, and it is considered as the best equivalent for the p-value.
ROPE_CI: Region of Practical Equivalence, since bayes method deals with true probabilities , it does not make sense to compute the probability of getting the effect equals zero (the null hypothesis) as a point (probability of a point in continuous intervals equal zero ). Thus, we define instead a small range around zero which can be considered practically the same as no effect (zero), this range therefore is called ROPE. By default (according to Cohen, 1988) The Rope is [-0.1,0.1] from the standardized coefficients.
Rhat: scale reduction factor $\hat R$, it is computed for each scalar quantity of interest, as the standard deviation of that quantity from all the chains included together, divided by the root mean square of the separate within-chain standard deviations. When this value is close to 1 we do not have any convergence problem with MCMC.
ESS : effective sample size, it captures how many independent draws contain the same amount of information as the dependent sample obtained by the MCMC algorithm, the higher the ESS the better. The threshold used in practice is 400.

Aternatively, we can get the coefficeient estimates (which are the medians by default) separatly by using the package insight

post <- get_parameters(model_bayes)
print(purrr::map_dbl(post, median), digits = 3)

## (Intercept)         age         dis       chas1 
##      32.834      -0.143      -0.258       7.543

We can also compute the Maximum A posteriori (MAP), and the mean as follows

print(purrr::map_dbl(post, map_estimate), digits = 3)

## (Intercept)         age         dis       chas1 
##      33.025      -0.145      -0.295       7.573

print(purrr::map_dbl(post, mean), digits = 3)

## (Intercept)         age         dis       chas1 
##      32.761      -0.143      -0.248       7.523

As we see the values are closer to each other due to the like normality of the distribution of the posteriors where all the central statistics (mean, median, mode) are closer to each other. Using the following plot to visualize the age coefficient using different statistics as follows

mcmc_dens(model_bayes, pars = c("age")) + vline_at(median(post$age), col = "red") + 
    vline_at(mean(post$age), col = "yellow") + vline_at(map_estimate(post$age), col = "green")

As expected they are approximately on top of each other.

5 Bayesian inferences

As we do with classical regression (frequentist), we can test the significance of the bayesian regression coefficients by checking whether the corresponding credible interval contains zero or not, if no then this coefficient is significant. Let’s go back to our model and check the significance of each coefficient (using credible based on the default hdi).

hdi(model_bayes)

## # Highest Density Interval
## 
## Parameter   |        89% HDI
## ----------------------------
## (Intercept) | [29.22, 36.29]
## age         | [-0.18, -0.11]
## dis         | [-0.67,  0.18]
## chas1       | [ 5.16,  9.81]

And based on the eti

eti(model_bayes)

## # Equal-Tailed Interval
## 
## Parameter   |        89% ETI
## ----------------------------
## (Intercept) | [29.20, 36.28]
## age         | [-0.17, -0.11]
## dis         | [-0.67,  0.18]
## chas1       | [ 5.17,  9.83]

Using both methods, the only non significant coefficient is dis variable, which is inline with the classical regression.

Note: this similar result between frequentist and bayesian regression may due to the normality assumption for the former that is well satisfied which gives satisfied results and due to the normal prior used in the latter. However, in real world it is less often to be sure about the normality assumption which may give contradict conclusions between the two approaches.

Another way to test the significance by checking the part of the credible interval that falls inside the ROPE interval. we can get this by calling the rope from bayestestR package

rope(post$age)

## # Proportion of samples inside the ROPE [-0.10, 0.10]:
## 
## inside ROPE
## -----------
## 0.00 %

For age almost all the credible interval (HDI) is outside the ROPE range, which means that coefficient is highly significant.

rope(post$chas1)

## # Proportion of samples inside the ROPE [-0.10, 0.10]:
## 
## inside ROPE
## -----------
## 0.00 %

rope(post$`(Intercept)`)

## # Proportion of samples inside the ROPE [-0.10, 0.10]:
## 
## inside ROPE
## -----------
## 0.00 %

The same thing is true for the chas and intercept variable.

rope(post$dis)

## # Proportion of samples inside the ROPE [-0.10, 0.10]:
## 
## inside ROPE
## -----------
## 20.02 %

In contrast, almost the quarter of the credible interval of dis variable is inside the ROPE interval. In other words, the probability of this coefficient to be zero is 23.28%.

rope_range(model_bayes)

## [1] -0.9197104  0.9197104

6 PD and P-value

Sometimes we are only interested to check the direction of the coefficient (positive or negative). this is the role of pd statistic in the above table, high value means that the associated effect is concentrated on the same side as the median. For our model, since pd’s equal to 1, almost all the posteriors of the two variables age and chas1 and the intercept are on the same side (if median negative all other values are negatives). However, it should be noted that this statistic does not assess the significance of the effect. Something more important to mention is that it exists a strong relation between this probability and the p-value approximated as follows: $p-value=1-pd$. let’s check this with our variables.

df1 <- dplyr::select(tidy(model_freq), c(term, p.value))
df1$p.value <- round(df1$p.value, digits = 3)
df2 <- 1 - purrr::map_dbl(post, p_direction)
df <- cbind(df1, df2)
df

##                    term p.value     df2
## (Intercept) (Intercept)   0.000 0.00000
## age                 age   0.000 0.00000
## dis                 dis   0.354 0.18075
## chas1             chas1   0.000 0.00000

# Conclusion

within the last decade more practitioner , specially in some fields such as medicine and psychology, are turning towards bayesian analysis since almost every thing can be interpreted straightforwardly with a probabilistic manner. However, the Bayesian analysis has also some drawback , like the subjective way to define the priors (which play an important role to compute the posterior), or for problems that do not have conjugate prior, not always the mcmc alghorithm converges easily to the right values (specially with complex data).

Kevin P.murphy, Machine Learning: A Probabilistic Perspective, 2012, page 589↩︎

Predicting images using Convolutional neural network

Sat, 25 Apr 2020 00:00:00 +0000

1 Introduction

In this article we will make use of the convolutional neural network, the most widely deep learning method used for image classification, object detection,..etc¹. For more detail about how it works please click here.

We are going be learning how to build and train convolutional neural network model using small sample of images collected from google search. The data includes 30 images, each of which is either one of three types of animals: cat, dog, or lion, and each one has equally number of images, that is 10.

2 Data preparation

First, we call the packages needed along this paper and load the data into two different objects, one called train, will contain 7 instances of each animal type used for training the model, and another one, called test, will contain the remaining instances for the evaluation of the model performance.

ssh <- suppressPackageStartupMessages
ssh(library(EBImage))
ssh(library(keras))
ssh(library(foreach))

mytrain <- c(paste0("../images/cat",1:7,".jpg"),paste0("../images/dog",1:7,".jpg"),
        paste0("../images/lion",1:7,".jpg"))

mytest <- c(paste0("../images/cat",8:10,".jpg"),paste0("../images/dog",8:10,".jpg"),
        paste0("../images/lion",8:10,".jpg"))

train <- lapply(mytrain, readImage)
test <- lapply(mytest, readImage)

Now let us first figure out what information each image contains .

train[[1]]

## Image 
##   colorMode    : Color 
##   storage.mode : double 
##   dim          : 275 183 3 
##   frames.total : 3 
##   frames.render: 1 
## 
## imageData(object)[1:5,1:6,1]
##           [,1]      [,2]      [,3]      [,4]      [,5]      [,6]
## [1,] 0.2039216 0.2039216 0.2039216 0.2078431 0.2000000 0.2000000
## [2,] 0.2039216 0.2039216 0.2078431 0.2078431 0.2000000 0.2039216
## [3,] 0.2078431 0.2078431 0.2078431 0.2117647 0.2039216 0.2078431
## [4,] 0.2117647 0.2117647 0.2156863 0.2156863 0.2117647 0.2117647
## [5,] 0.2156863 0.2156863 0.2196078 0.2196078 0.2156863 0.2156863

As we see this image is color image with 275 pxl hight, 183 pxl width and 3 channels (RGB) since it is a color image.

we can visualize an image as follows:

plot(test[[4]])

If instead we want to visualize all the image as one block we can make use of foreach package to apply a for loop as follows.

par(mfrow=c(7,3))
foreach(i=1:21) %do% {plot(train[[i]])}

par(mfrow=c(1,1))

After having taken a brief glance at our data, we found that the image sizes are different from each other which is not what our image classification model expects. That is why, the following script will resize all the images to have the same size 150x150x3.

foreach(i=1:21) %do% {train[[i]] <- resize(train[[i]],150,150)}
foreach(i=1:9) %do% {test[[i]] <- resize(test[[i]],150,150)}

To check the result we use the following:

str(test)

## List of 9
##  $ :Formal class 'Image' [package "EBImage"] with 2 slots
##   .. ..@ .Data    : num [1:150, 1:150, 1:3] 0.761 0.78 0.773 0.755 0.768 ...
##   .. ..@ colormode: int 2
##  $ :Formal class 'Image' [package "EBImage"] with 2 slots
##   .. ..@ .Data    : num [1:150, 1:150, 1:3] 0.441 0.465 0.496 0.528 0.524 ...
##   .. ..@ colormode: int 2
##  $ :Formal class 'Image' [package "EBImage"] with 2 slots
##   .. ..@ .Data    : num [1:150, 1:150, 1:3] 0.986 0.992 0.951 0.945 0.929 ...
##   .. ..@ colormode: int 2
##  $ :Formal class 'Image' [package "EBImage"] with 2 slots
##   .. ..@ .Data    : num [1:150, 1:150, 1:3] 0.81 0.751 0.787 0.825 0.508 ...
##   .. ..@ colormode: int 2
##  $ :Formal class 'Image' [package "EBImage"] with 2 slots
##   .. ..@ .Data    : num [1:150, 1:150, 1:3] 0.357 0.5 0.49 0.626 0.522 ...
##   .. ..@ colormode: int 2
##  $ :Formal class 'Image' [package "EBImage"] with 2 slots
##   .. ..@ .Data    : num [1:150, 1:150, 1:3] 0.375 0.365 0.375 0.397 0.393 ...
##   .. ..@ colormode: int 2
##  $ :Formal class 'Image' [package "EBImage"] with 2 slots
##   .. ..@ .Data    : num [1:150, 1:150, 1:3] 0.651 0.57 0.614 0.634 0.63 ...
##   .. ..@ colormode: int 2
##  $ :Formal class 'Image' [package "EBImage"] with 2 slots
##   .. ..@ .Data    : num [1:150, 1:150, 1:3] 0.268 0.201 0.198 0.213 0.182 ...
##   .. ..@ colormode: int 2
##  $ :Formal class 'Image' [package "EBImage"] with 2 slots
##   .. ..@ .Data    : num [1:150, 1:150, 1:3] 0 0 0 0 0 ...
##   .. ..@ colormode: int 2

As we see all the images now have the same size as an array of 3 dimensions. The next step now is to combine all the images as one block.

trainall <- combine(train)
testall <- combine(test)

We can display the output block usine the following:

display(tile(trainall,7))

Now the images are nicely combined in one block with four dimensions: number of instances (images), height, width, and number of channels, and this is the input that will be used in our model. However, to correctly read the input our model expects that the first dimension is the number of instances, the second is height , the third is width, and the fourth is the number of channels. Let us check whether the input has the correct order or not.

str(trainall)

## Formal class 'Image' [package "EBImage"] with 2 slots
##   ..@ .Data    : num [1:150, 1:150, 1:3, 1:21] 0.204 0.209 0.216 0.223 0.233 ...
##   ..@ colormode: int 2

This order is not correct since the number of instances is in the last position, so we reorder the positions as follows:

trainall <- aperm(trainall, c(4,1,2,3))
testall <- aperm(testall, c(4,1,2,3))

The Last thing that remains to be done, before customizing the architecture of our model, is to create a variable to hold the image’s labels, then convert it to a dummy variable.

trainlabels <- rep(0:2, each=7)
testlabels <- rep(0:2, each=3)
trainy <- to_categorical(trainlabels)
testy <- to_categorical(testlabels)

3 Training the model:

The architecture of our model will contain the following layers:

Convolution layer that makes use of 32 filters with size 3x3 (since the input has 150x150x3 consequently the third dimension of the filter size will be 3 that is 3x3x3), and with Relu as activation function.
maxPooling layer of 3x3 with strides=2.
Convolution layer that makes use of 64 filters with size 5x5 , and with Relu function.
maxPooling layer of 2x2 with strides=2.
Convolution layer that makes use of 128 filters with size 3x3 , and with Relu function.
maxPooling layer of 2x2 with strides=2.
Flatten layer to collapse all the output elements into one giant vector to be able to connect to the traditional neural network with fully connected layers.
dense layers composed of 256 nodes and with leaky_relu function. The slope for the negative part will be 0.1.
Dropout layer with rate of 40%, this acts as regularization method by randomly ignoring 40% of nodes in each epoch (iteration).
the last output layer with 3 nodes since we have 3 classes and with softmax function.

In keras package the above steps will be coded as follows:

model <- keras_model_sequential()

model %>% 
  layer_conv_2d(filters = 32,
                        kernel_size = c(3,3),
                        activation = "relu",
                        input_shape = c(150,150,3))%>%
  layer_max_pooling_2d(pool_size = c(3,3), strides = 2)%>%
  layer_conv_2d(filters = 64,
               kernel_size = c(5,5),
                activation = "relu") %>%
  layer_max_pooling_2d(pool_size = c(2,2), strides = 2)%>%
  layer_conv_2d(filters = 128,
                kernel_size = c(3,3),
                activation = "relu") %>%
  layer_max_pooling_2d(pool_size = c(2,2), strides = 2)%>%
  layer_flatten()%>%
  layer_dense(units=256)%>% layer_activation_leaky_relu(alpha = 0.1)%>%
  layer_dropout(rate=0.4)%>%
  layer_dense(units=3, activation = "softmax")

We can figure out this architecture and how many parameters it has by calling the summary function.

summary(model)

## Model: "sequential"
## ________________________________________________________________________________
## Layer (type)                        Output Shape                    Param #     
## ================================================================================
## conv2d (Conv2D)                     (None, 148, 148, 32)            896         
## ________________________________________________________________________________
## max_pooling2d (MaxPooling2D)        (None, 73, 73, 32)              0           
## ________________________________________________________________________________
## conv2d_1 (Conv2D)                   (None, 69, 69, 64)              51264       
## ________________________________________________________________________________
## max_pooling2d_1 (MaxPooling2D)      (None, 34, 34, 64)              0           
## ________________________________________________________________________________
## conv2d_2 (Conv2D)                   (None, 32, 32, 128)             73856       
## ________________________________________________________________________________
## max_pooling2d_2 (MaxPooling2D)      (None, 16, 16, 128)             0           
## ________________________________________________________________________________
## flatten (Flatten)                   (None, 32768)                   0           
## ________________________________________________________________________________
## dense (Dense)                       (None, 256)                     8388864     
## ________________________________________________________________________________
## leaky_re_lu (LeakyReLU)             (None, 256)                     0           
## ________________________________________________________________________________
## dropout (Dropout)                   (None, 256)                     0           
## ________________________________________________________________________________
## dense_1 (Dense)                     (None, 3)                       771         
## ================================================================================
## Total params: 8,515,651
## Trainable params: 8,515,651
## Non-trainable params: 0
## ________________________________________________________________________________

As we see we have huge number of parameters 8 515 651. Since the data has only 21 instances, the computation process in my laptop takes only few seconds. However, with large data set this model may take more time.

The last step before running the model is to specify the loss function, the optimizer and the metric.

For multiclassification problem the most widely used one is categorical cross entropy.
Besides the popular gradient descent optimizer (with its versions , stochastic gradient descent and mini batch gradient descent), there exist other ones such as adam , adadelta, mrsprop (the first one will be used for our case). In practice sometimes we finetune the hyperparameters by changing these optimizers.
For classification problems we have many metrics, the famous ones are: accuracy (used for our case), roc, area under roc, precision.

model %>% compile(loss= "categorical_crossentropy",
                  optimizer="adam",
                  metrics="accuracy")

At this stage everything is ready to train our model by calling the function fit. the epoch value is the number of iterations or the gradient descent steps, and the validation_split is the holdout samples used for assessment, here four images. I have run this model before and in oreder to avoide running it again i have commented the script by #, if you want to run it just uncomment the script.

#history <- model %>%
  #fit(trainall, trainy, epoch=50, validation_split=0.2)

unlike machine learning model in which we can set a seed to get the result reproducible, here each time we rerun the model we get different result. In practice, we intentionally rerun the model many times to improve the model performance, and ones we get the best one we save it as follows:

#save_model_hdf5(model, "modelcnn.h5")

And we can load it again as follows:

model <- load_model_hdf5("modelcnn.h5")

The history object has all the necessary information such as the metric values for each epoch , so we can extract this informatiton to create a plot as follows.

#train_loss <- history$metrics$loss
#valid_loss <- history$metrics$val_loss
#train_acc <- history$metrics$accuracy
#valid_acc <- history$metrics$val_accuracy
#epoch <- 1:50

#df <- tibble::tibble(epoch,train_loss,valid_loss,train_acc,valid_acc)

library(ggplot2)

## Warning: package 'ggplot2' was built under R version 4.0.2

#p1 <- ggplot(df,aes(x=epoch, train_loss))+
 # geom_point(size=1, color="blue")+
#  geom_point(aes(x=epoch, valid_loss), size=1, color="red")+
 # ylab("Loss")
#ggsave("plot_loss.jpg", p1, device = "jpeg", width = 20, height = 15, units = "cm")

As you notice all the above codes have not been executed to avoide the issue discussed above and to make things simple. here we load the saved plot.

par(mar=c(0,0,0,0))
plot(as.raster(readImage("plot_loss.jpg")))

This plot shows the loss values for both the training set (in blue) and the validation set (in red), we see that the training loss consistently increases whereas the validation loss largely oscillating reflecting the less capability of the model to well predict the new unseen examples. The same conclusion can be induced from the following plot for the accuracy metric.

#p2 <- ggplot(df,aes(x=epoch, train_acc))+
  #geom_point(size=1, color="blue")+
  #geom_point(aes(x=epoch, valid_acc), size=1, color="red")+
 # ylab("accuracy")
#ggsave("plot_acc.jpg", p2, device = "jpeg", width = 20, height = 15, units = "cm")

Here also we do the same thing

par(mar=c(0,0,0,0))
plot(as.raster(readImage("plot_acc.jpg")))

Note: we coud have used directly the plot function , plot(history), but doing so we will get different plot each time we knit the document.

4 Model Evaluation

We can evaluate the model performance using the training set as follows:

train_evaluate<- evaluate(model, trainall, trainy)

With this first architecture we get a high accuracy rate 95.24% and the loss is 0.0832. However, you should be cautious when this rate is very high since it is computed from the training data which in many cases reflects the overfitting problem². The best evaluation thus is that based on the testing set.

test_evaluate<- evaluate(model, testall, testy)

Using the testing set that is not seen by the model, the accuracy rate is about 55.56%. In fact this is exactly what we warned about it, indeed we have an overfitting problem where the model try to memorize every noisy pattern which will constrain the model to poorly generalize to unseen data.

5 Prediction

We can get the predictions of the testing set as follows:

pred <- predict_classes(model,testall)
pred

## [1] 0 0 2 0 2 0 2 2 2

the following plot shows which images from the testing set are correctly classified and which are not:

pred[pred==0] <- "cat"
pred[pred==1] <- "dog"
pred[pred==2] <- "lion"


par(mfrow=c(3,3))


foreach(i=1:9) %do% {display(test[[i]], method="raster");
  text(x = 20, y = 20, label = pred[i], 
       adj = c(0,1), col = "black", cex = 4)
}

par(mfrow=c(1,1))

Using this model to predict the test examples, all the dogs are misclassified whereas the lions are perfectly classified.

We can also display the training examples as follows:

pred1 <- predict_classes(model,trainall)

pred1[pred1==0] <- "cat"
pred1[pred1==1] <- "dog"
pred1[pred1==2] <- "lion"


par(mfrow=c(7,3))


foreach(i=1:21) %do% {display(train[[i]], method="raster");
  text(x = 20, y = 20, label = pred1[i], 
       adj = c(0,1), col = "black", cex = 2)
}

par(mfrow=c(1,1))

6 Conclusion

As we see this model perfectly identified lions but failed to identify any of the dogs in the testing set which is not the case for the training data where the model has high accuracy, and as I mentioned earlier this is the consequences of the overfitting problem. However, There are bunch of techniques that can be used in such situation such as regularization methods (L2,L1), pooling, dropout layers..ect. All these techniques will be addressed soon in further articles. Besides overfitting, we can also improve the model by playing around with hyperparameters such as changing , the number of the layers, number of the nodes in each layer, number of epochs…ect.

Note: Be aware that this model can not be reliable since it has used very small data. However, we may get a higher performance for this model if we implement very large dataset.

#Session information

sessionInfo()

## R version 4.0.1 (2020-06-06)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 19041)
## 
## Matrix products: default
## 
## locale:
## [1] LC_COLLATE=English_United States.1252 
## [2] LC_CTYPE=English_United States.1252   
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C                          
## [5] LC_TIME=English_United States.1252    
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] ggplot2_3.3.2  foreach_1.5.0  keras_2.3.0.0  EBImage_4.30.0
## 
## loaded via a namespace (and not attached):
##  [1] reticulate_1.16     locfit_1.5-9.4      tidyselect_1.1.0   
##  [4] xfun_0.18           purrr_0.3.4         lattice_0.20-41    
##  [7] colorspace_1.4-1    generics_0.0.2      vctrs_0.3.4        
## [10] htmltools_0.5.0     yaml_2.2.1          base64enc_0.1-3    
## [13] rlang_0.4.7         pillar_1.4.6        withr_2.3.0        
## [16] glue_1.4.2          rappdirs_0.3.1      BiocGenerics_0.34.0
## [19] jpeg_0.1-8.1        lifecycle_0.2.0     tensorflow_2.2.0   
## [22] stringr_1.4.0       munsell_0.5.0       blogdown_0.20      
## [25] gtable_0.3.0        htmlwidgets_1.5.2   codetools_0.2-16   
## [28] evaluate_0.14       knitr_1.30          tfruns_1.4         
## [31] parallel_4.0.1      Rcpp_1.0.5          scales_1.1.1       
## [34] jsonlite_1.7.1      abind_1.4-5         png_0.1-7          
## [37] digest_0.6.25       stringi_1.5.3       tiff_0.1-5         
## [40] bookdown_0.20       dplyr_1.0.2         grid_4.0.1         
## [43] tools_4.0.1         bitops_1.0-6        magrittr_1.5       
## [46] RCurl_1.98-1.2      tibble_3.0.3        crayon_1.3.4       
## [49] whisker_0.4         pkgconfig_2.0.3     zeallot_0.1.0      
## [52] ellipsis_0.3.1      Matrix_1.2-18       rmarkdown_2.4      
## [55] iterators_1.0.12    R6_2.4.1            fftwtools_0.9-9    
## [58] compiler_4.0.1

Francois chollet, Deep learning with R, Meap edition, 2017, P112 ↩︎
An introduction to statistical learning, Garth et al, spring, New York, page 33, ISBN:978-1-4614-7173-0 ↩︎

Predicting large and imbalanced data set using the R package tidymodels

Tue, 14 Apr 2020 00:00:00 +0000

Introduction

The super easy way, at least for me, to deploy machine learning models is by making use of the R package tidymodels, which is a collection of many packages that makes the workflow steps for your project very smooth and tightly connected to each other and easily managable in a well-structured manner. The core packages contained in tidymodels are:

rsample: for data splitting and rsampling.
parsnip: Unified interface to the most common machine learning models.
recipes: unified interface to the most common pre-processing tools for feature engineering.
workflows: bundle the workflow steps together.
tune: for optimization of the hyperparameters.
yardstick: provides the most common performance metrics.
broom: converts the outputs into user friendly formats such as tibble.
dials: provides tools for parameter grids.
infer: provides tools for statistical inferences.

In addition to the above apackages tidymodels contains also some classical packages such as: dplyr, ggplot2, purrr, tibble. For more detail click here.

In order to widely explore and understand the tidymodels, we should look for a noisy dataset that has large number of variables with missing values. Fortunately, I found an open source dataset that fulfils these requirements and in addition, it is highly imbalanced. This data is about scania trucks and can be downloaded from UCI machine learning repository with an extra file for its description.

the target variable of this data is the air pressure system APS in the truck that generates the pressurized air that are utilized in various function in the truck. It has two classes: positive pos if a component failures due to a failures in the APS system, negative neg if a component failures are not related to the APS system. This means that we are dealing with binary classification problem.

Data exploration

The data is already separated into training and testing set from the source, so let’s call the packages that we need and the data.

ssh <- suppressPackageStartupMessages
ssh(library(readr))
ssh(library(caret))
ssh(library(themis))
ssh(library(tidymodels))
train <- read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/00421/aps_failure_training_set.csv", skip = 20)
test <- read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/00421/aps_failure_test_set.csv", skip = 20)

Notice that the data is a tibble where the first twenty rows are a mix of rows that contain some text descritpion and empty rows. and the 21th row contains the column names. That is why we have set the skip argument equals to 20, and for the 21th column by default has been read as colnames col_names = TRUE.

Summary of the variables

First let’s check the dimension of the two sets to be aware of what we are dealing with.

dim(train)

[1] 60000   171

dim(test)

[1] 16000   171

The training set has 60000 rows and 171 variables which is moderately large dataset. Inspecting thus this data by the usual functions such as summary, str would give heavy and not easily readable outputs. the best alternative However is by extracting the most important information that is required for building any machine learning model in aggregated way, for instance, what type of variables it has, some statistics about the variable values, the missing values..etc.

map_chr(train, typeof) %>% 
  tibble() %>% 
  table()

.
character    double 
      170         1

Strangely, all the variables but one are characters, which is in contradiction with the description of this data from the file description. To figure out what is going on we display some few rows and some few columns.

train[1:5,1:7]

# A tibble: 5 x 7
  class aa_000 ab_000 ac_000     ad_000 ae_000 af_000
  <chr>  <dbl> <chr>  <chr>      <chr>  <chr>  <chr> 
1 neg    76698 na     2130706438 280    0      0     
2 neg    33058 na     0          na     0      0     
3 neg    41040 na     228        100    0      0     
4 neg       12 0      70         66     0      10    
5 neg    60874 na     1368       458    0      0

I think, the problem is that the missing values in the data indicated by na are not recognized as missing values, instead they are treated as characters and this what makes the function read_csv coerces every variable that has this na values to character type. To fix this problem we can either go back and set the na argument to “na”, or we set the missing values by hand as follows.

train[-1] <- train[-1] %>% 
  modify(~replace(., .=="na", NA)) %>%
  modify(., as.double)

Now let’s check again

map_chr(train, typeof) %>% 
  tibble() %>% 
  table()

.
character    double 
        1       170

The first column excluded above is our target variable class. We should not forget to do the same transformation to the test set.

test[-1] <- test[-1] %>% 
  modify(~replace(., .=="na", NA)) %>%
  modify(., as.double)

If we try to apply the summary function on the entire variables (170), we will spent a lot of time to read the summary of each variable without much gain. Instead, we try to get an automated way to get only the information needed to build efficiently our model. To decide whether we should normalize the data or not, for instance, we display the standard deviances of all the variable in decreasing order.

Note: with tree based models we do not need neither normalize the data nor converting factors to dummies.

map_dbl(train[-1], sd, na.rm=TRUE) %>% 
  tibble(sd = .) %>% 
  arrange(-sd)

# A tibble: 170 x 1
           sd
        <dbl>
 1 794874918.
 2  97484780.
 3  42746746.
 4  40404413.
 5  40404412.
 6  40404411.
 7  11567771.
 8  10886737.
 9  10859905.
10  10859904.
# ... with 160 more rows

We have very large variability, which means that the data should be normalized for any machine learning model that uses gradient descent or based on class distances.

Another thing we can check is if some variabels have small number of unique values which can hence be converted to factor type.

map(train[-1], unique) %>% 
  lengths(.) %>% 
  sort(.) %>% 
  head(5)

cd_000 ch_000 as_000 ef_000 ab_000 
     2      3     22     29     30

To make things simple we consider only the first two ones to be converted to factor type.

the first one is constant which is of type zero variance because its variance equals to zero, and the second one should be converted to factor type with two levels (for the two sets), but since it has large missing values we will decide about it later on. Notice that we do not apply theses transformations here because they will be combined at once with all the required transformations as what will be shown shortly.

Missing values

The best way to deal with missing values depends on their number compared to the dataset size. if we have small number then it would be easier to completely remove them from the data, if in contrast we have large number then the best choice is to impute them using one of the common methods designed for this type of issue.

dim(train[!complete.cases(train),])

[1] 59409   171

As we see almost every row contains at least one missing value in some columns. Let’s check the distribution of missing values within columns.

df <- modify(train[-1], is.na) %>% 
  colSums() %>%
  tibble(names = colnames(train[-1]),missing_values=.) %>% 
  arrange(-missing_values)
  
df

# A tibble: 170 x 2
   names  missing_values
   <chr>           <dbl>
 1 br_000          49264
 2 bq_000          48722
 3 bp_000          47740
 4 bo_000          46333
 5 ab_000          46329
 6 cr_000          46329
 7 bn_000          44009
 8 bm_000          39549
 9 bl_000          27277
10 bk_000          23034
# ... with 160 more rows

I think the best strategy is to first remove columns that have a large number of missing values then we impute the rest, thereby we reduce the number of predictors and the number of missing values at ones. The following script keep the predictors that have a number of missing values less than 10000.

names <- modify(train[-1], is.na) %>% 
  colSums() %>%
  tibble(names = colnames(train[-1]), missing_values=.) %>% 
  filter(missing_values < 10000) 
train1 <- train[c("class",names$names)]
test1 <- test[c("class",names$names)]

An important thing should be noted here is that, if we use imputation methods that use information from all other columns and/or rows to predict the current missing value, therefore the data must be first split between training and testing sets before any imputation, to abide by the crucial rule of machine learning: the test data should never be seen by the model during training process. Fortunately, our data is already split so that the imputation can be done separately. However, the imputation methods will be implemented later on by the help of the recipes package where we bundle all the pre-processing steps together. Note: the above ch_000 was removed since it did not fulfill the required threshold.

imbalanced data

Another important issue that we face when predicting this data is the imbalanced problem.

prop.table(table(train1$class))


       neg        pos 
0.98333333 0.01666667

This data is highly imbalanced, which tends to make even the worst machine learning model gives very high accuracy rate. In other words, if we do not use any model and predict every class as the largest class label (in our case negative) the accuracy rate will be approximately equal to the proportion of the largest class (in our case 98%), which is very big misleading result. Moreover, this misleading result can be catastrophic if we are more interested to predict the small class (in our case positive) such as detecting fraudulent credit cards. If you would like to get more detail about how to deal with imbalanced data please check this article.

building the recipe

Our initial model will be the random forest wich is the most popular one . So the first step to build our model is by defining our model with the engine, which is the method (or the package) used to fit this model, and the mode with two possible values classification or regression. In our case, for instance, there exists two available engines: randomForest or ranger. Notice that the parsnip package who provides these settings. For more detail about all the models available click here.

Note: To speed up the computation process we restrict the forest to 100 trees instead of the default 500.

rf <- rand_forest(trees = 100) %>% 
  set_engine("ranger", num.threads=3, seed = 123) %>%
  set_mode("classification")

Most machine learning models require pre-processed data with some feature engineering. Traditionally, R has (and some other packages such as dplyr and stringr) provides a wide range of functions such that we can do almost every kind of feature engineering. However, if we have many different transformations to perform then they will be done separately and it will be a little cumbersome to repeat the same scripts again for testing set for instance. Therefore, the recipes package provides an easy way to combine all the transformations and other features related to the model, such as selecting the predictors that should be included, identifiers, …etc, as a single block that can be used for any other subset of the data.

For our case we will apply the following transformations:

Imputing the missing values by the median of the corresponding variable since we have only numeric variables (for simplicity).
removing variables that have zero variance (variable that has one unique value).
removing highly correlated predictor using threshold 0.8.
Normalizing the data (even we do not need it in this model but we add this step since this recipe will be used with other models that use gradient decent or distances calculations).
using the subsampling method smote to create a balanced data. Notice that the smote method is provided by the package themis

To combine all these operations together we call the function recipe.

data_rec <- recipe(class~., data=train1) %>% 
  step_medianimpute(all_predictors() , seed_val = 111) %>% 
  step_zv(all_predictors()) %>% 
  step_corr(all_predictors(), threshold = 0.8) %>% 
  step_normalize(all_predictors()) %>%
  step_smote(class)

As you see everything combined nicely and elegantly. However, this recipe transformed nothing yet, it just recorded the formula, the predictors and the transformations that should be applied. This means that we can update, at any time before fitting our model, the formula, add or remove some steps. The super interesting feature of recipe is that we can apply it to any other data (than that mentioned above, train) provided that has the same variable names. In case you want to apply these transformations to the training data use the prep function, and to retrieve the results use the function juice, and for other data use bake after prep to be able to apply some parameters from the training data, for instance, when we normalize the data this function lets us use the mean of predictors computed from the training data rather than from the testing data. However, in our case, we will combine everything until the model fitting step.
For more detail about all the steps available click here.

Building the workflow

To well organize our workflow in a structured and smoother way, we use the workflow package that is one of the tidymodels collection.

rf_wf <- workflow() %>% 
  add_model(rf) %>% 
  add_recipe(data_rec)
rf_wf

== Workflow =======================
Preprocessor: Recipe
Model: rand_forest()

-- Preprocessor -------------------
5 Recipe Steps

* step_medianimpute()
* step_zv()
* step_corr()
* step_normalize()
* step_smote()

-- Model --------------------------
Random Forest Model Specification (classification)

Main Arguments:
  trees = 100

Engine-Specific Arguments:
  num.threads = 3
  seed = 123

Computational engine: ranger

random forest model

Now we can run everything at once, the recipe and the model, notice that here we can also update, add or remove some elements before going ahead and fit the model.

model training

Everything now is ready to run our model with the default values.

model_rf <- rf_wf %>% 
  fit(data = train1)

We can extract the summary of this model as follows

model_rf %>% pull_workflow_fit()

parsnip model object

Fit time:  55.7s 
Ranger result

Call:
 ranger::ranger(formula = ..y ~ ., data = data, num.trees = ~100,      num.threads = ~3, seed = ~123, verbose = FALSE, probability = TRUE) 

Type:                             Probability estimation 
Number of trees:                  100 
Sample size:                      118000 
Number of independent variables:  95 
Mtry:                             9 
Target node size:                 10 
Variable importance mode:         none 
Splitrule:                        gini 
OOB prediction error (Brier s.):  0.003998112

This model has created 100 trees and has chosen randomly 9 predictors with each tree. with these settings thus we do obtain very low oob error rate which is 0.4% (accuracy rate 99.6% ). However, be cautious with such high accuracy rate, since, in practice, This result may highly related to an overfitting problem. Last thing I want to mention about this output, by looking at the confusion matrix, is the fact that we have now balanced data.

model evaluation

The best way to evaluate our model is by using the testing set. Notice that the yardstick provides bunch of metrics to use, but let’s use the most popular one for classification problems accuracy.

model_rf %>% 
  predict( new_data = test1) %>% 
  bind_cols(test1["class"]) %>% 
  accuracy(truth= as.factor(class), .pred_class)

# A tibble: 1 x 3
  .metric  .estimator .estimate
  <chr>    <chr>          <dbl>
1 accuracy binary         0.990

with this model we get high accuracy which is very closer to the previous one. However, we should not forget that we are dealing with imbalanced data, and even though we have used subsampling methods (like smote method used here), they do not completely solve this issue, they can only minimize it at certain level and this is the reason why we have many of these methods. Therefore, it is better to use the confusion matrix from the caret package since it gives more information.

caret::confusionMatrix(as.factor(test1$class), predict(model_rf, new_data = test1)$.pred_class)

Confusion Matrix and Statistics

          Reference
Prediction   neg   pos
       neg 15532    93
       pos    64   311
                                          
               Accuracy : 0.9902          
                 95% CI : (0.9885, 0.9917)
    No Information Rate : 0.9748          
    P-Value [Acc > NIR] : < 2e-16         
                                          
                  Kappa : 0.7934          
                                          
 Mcnemar's Test P-Value : 0.02544         
                                          
            Sensitivity : 0.9959          
            Specificity : 0.7698          
         Pos Pred Value : 0.9940          
         Neg Pred Value : 0.8293          
             Prevalence : 0.9748          
         Detection Rate : 0.9708          
   Detection Prevalence : 0.9766          
      Balanced Accuracy : 0.8828          
                                          
       'Positive' Class : neg

As said shortly, the specificity rate related to the minor class 78% is very low compared to the major class 99%, and You can think of this as a partial overfitting towards the major class. So if we are more interested to the minor class (which is often the case) then we have go back to our model and try tuning our model or try another subsampling method.

Model tuning:

For model tuning we try other values for some arguments rather than the default vaues. and leave the tuning for some others to the dials package. So let’s try the following argument values:

num.trees = 100. The default is 500.
num.threads = 3. The default is 1.

And tune the following:

mtry = tune(). The default is square root of the number of the variables.
min_n = tune(). The default is 1.

First, we define the model with these new arguments.

model_tune <- rand_forest(trees= 100, mtry=tune(), min_n = tune()) %>%
  set_engine("ranger", num.threads=3, seed=123) %>% 
  set_mode("classification")

Since in grid search the two arguments mtry and min_n are data dependent, then we should at least specify their ranges.

grid <- grid_regular(mtry(range = c(9,15)), min_n(range = c(5,40)), levels = 3)
grid

# A tibble: 9 x 2
   mtry min_n
  <int> <int>
1     9     5
2    12     5
3    15     5
4     9    22
5    12    22
6    15    22
7     9    40
8    12    40
9    15    40

By setting the levels equal to 3 we get 9 combinations and hence 9 models will be trained. The above recipe has steps that should not be repeated many times when tuning the model, we apply therefore the recipe to the training data in order to get the transformed data, and do not forget to apply the recipe to the testing data.

train2 <- prep(data_rec) %>% 
  juice()
test2 <- prep(data_rec) %>% 
  bake(test1)

To tune our model we use cross validation technique. since we have large data set we use only 3 folds.

set.seed(111)
fold <- vfold_cv(train2, v = 3, strata = class)

Now we bundle our recipe with the specified model.

tune_wf <- workflow() %>% 
  add_model(model_tune) %>%
  add_formula(class~.)

To fit these models across the folds we use the tune_grid function instead of fit.

tune_rf <- tune_wf %>% 
  tune_grid(resamples = fold, grid = grid)

For classification problems this function uses two metrics: accuracy and area under the ROC curve. SO we can extract the metric values as follows.

results <- tune_rf %>% collect_metrics()

To get the best model we have to choose one of the two metrics, so let’s go ahead with the accuracy rate.

best_param <- 
  tune_rf %>% select_best(metric = "accuracy")
best_param

# A tibble: 1 x 3
   mtry min_n .config
  <int> <int> <chr>  
1    15     5 Model3

we can finalize the workflow with the new parameter values.

tune_wf2 <- tune_wf %>% 
  finalize_workflow(best_param)
tune_wf2

== Workflow =======================
Preprocessor: Formula
Model: rand_forest()

-- Preprocessor -------------------
class ~ .

-- Model --------------------------
Random Forest Model Specification (classification)

Main Arguments:
  mtry = 15
  trees = 100
  min_n = 5

Engine-Specific Arguments:
  num.threads = 3
  seed = 123

Computational engine: ranger

Now we fit the model with the best parameter values to the entire training data.

best_model <- tune_wf2 %>% 
  fit(train2)
best_model

== Workflow [trained] =============
Preprocessor: Formula
Model: rand_forest()

-- Preprocessor -------------------
class ~ .

-- Model --------------------------
Ranger result

Call:
 ranger::ranger(formula = ..y ~ ., data = data, mtry = ~15L, num.trees = ~100,      min.node.size = ~5L, num.threads = ~3, seed = ~123, verbose = FALSE,      probability = TRUE) 

Type:                             Probability estimation 
Number of trees:                  100 
Sample size:                      118000 
Number of independent variables:  95 
Mtry:                             15 
Target node size:                 5 
Variable importance mode:         none 
Splitrule:                        gini 
OOB prediction error (Brier s.):  0.00359659

Let’s get the confusion matrix

caret::confusionMatrix(as.factor(test2$class), predict(best_model, new_data = test2)$.pred_class)

Confusion Matrix and Statistics

          Reference
Prediction   neg   pos
       neg 15538    87
       pos    67   308
                                          
               Accuracy : 0.9904          
                 95% CI : (0.9887, 0.9918)
    No Information Rate : 0.9753          
    P-Value [Acc > NIR] : <2e-16          
                                          
                  Kappa : 0.7951          
                                          
 Mcnemar's Test P-Value : 0.1258          
                                          
            Sensitivity : 0.9957          
            Specificity : 0.7797          
         Pos Pred Value : 0.9944          
         Neg Pred Value : 0.8213          
             Prevalence : 0.9753          
         Detection Rate : 0.9711          
   Detection Prevalence : 0.9766          
      Balanced Accuracy : 0.8877          
                                          
       'Positive' Class : neg

As we see we do not get any improvement for the specificity rate. so let’s try another subsampling method, say Rose method.

rf_rose <- rand_forest(trees = 100, mtry=9, min_n = 5) %>% 
  set_engine("ranger", num.threads=3, seed = 123) %>%
  set_mode("classification")
data_rec2 <- recipe(class~., data=train1) %>% 
  step_medianimpute(all_predictors() , seed_val = 111) %>% 
  step_zv(all_predictors()) %>% 
  step_corr(all_predictors(), threshold = 0.8) %>% 
  step_normalize(all_predictors()) %>%
  step_rose(class) 
rf_rose_wf <- workflow() %>% 
  add_model(rf_rose) %>% 
  add_recipe(data_rec2)
model_rose_rf <- rf_rose_wf %>% 
  fit(data = train1)
caret::confusionMatrix(as.factor(test1$class), predict(model_rose_rf, new_data = test1)$.pred_class)

Confusion Matrix and Statistics

          Reference
Prediction   neg   pos
       neg 15522   103
       pos   140   235
                                          
               Accuracy : 0.9848          
                 95% CI : (0.9828, 0.9867)
    No Information Rate : 0.9789          
    P-Value [Acc > NIR] : 2.437e-08       
                                          
                  Kappa : 0.6514          
                                          
 Mcnemar's Test P-Value : 0.02092         
                                          
            Sensitivity : 0.9911          
            Specificity : 0.6953          
         Pos Pred Value : 0.9934          
         Neg Pred Value : 0.6267          
             Prevalence : 0.9789          
         Detection Rate : 0.9701          
   Detection Prevalence : 0.9766          
      Balanced Accuracy : 0.8432          
                                          
       'Positive' Class : neg

The rose method is much worse than smote method since the specificity rate has doped down to 69%.

logistic regression model

The logistic regression is another model to fit data with binary outcome. As before we use the first recipe with smote method.

logit <- logistic_reg() %>% 
  set_engine("glm") %>%
  set_mode("classification")

logit_wf <- workflow() %>% 
  add_model(logit) %>% 
  add_recipe(data_rec)

set.seed(123)
model_logit <- logit_wf %>% 
  fit(data = train1)

caret::confusionMatrix(as.factor(test1$class), predict(model_logit, new_data = test1)$.pred_class)

Confusion Matrix and Statistics

          Reference
Prediction   neg   pos
       neg 15327   298
       pos    59   316
                                          
               Accuracy : 0.9777          
                 95% CI : (0.9753, 0.9799)
    No Information Rate : 0.9616          
    P-Value [Acc > NIR] : < 2.2e-16       
                                          
                  Kappa : 0.6282          
                                          
 Mcnemar's Test P-Value : < 2.2e-16       
                                          
            Sensitivity : 0.9962          
            Specificity : 0.5147          
         Pos Pred Value : 0.9809          
         Neg Pred Value : 0.8427          
             Prevalence : 0.9616          
         Detection Rate : 0.9579          
   Detection Prevalence : 0.9766          
      Balanced Accuracy : 0.7554          
                                          
       'Positive' Class : neg

with this model we do not get better rate for minority class than random forest model.

Session information

sessionInfo()

R version 4.0.1 (2020-06-06)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19041)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252 
[2] LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] yardstick_0.0.7  workflows_0.2.0  tune_0.1.1       tidyr_1.1.2     
 [5] tibble_3.0.3     rsample_0.0.8    purrr_0.3.4      parsnip_0.1.3   
 [9] modeldata_0.0.2  infer_0.5.3      dials_0.0.9      scales_1.1.1    
[13] broom_0.7.1      tidymodels_0.1.1 themis_0.1.2     recipes_0.1.13  
[17] dplyr_1.0.2      caret_6.0-86     ggplot2_3.3.2    lattice_0.20-41 
[21] readr_1.3.1     

loaded via a namespace (and not attached):
 [1] nlme_3.1-149         lubridate_1.7.9      doParallel_1.0.15   
 [4] DiceDesign_1.8-1     tools_4.0.1          backports_1.1.10    
 [7] utf8_1.1.4           R6_2.4.1             rpart_4.1-15        
[10] colorspace_1.4-1     nnet_7.3-14          withr_2.3.0         
[13] prettyunits_1.1.1    tidyselect_1.1.0     curl_4.3            
[16] compiler_4.0.1       parallelMap_1.5.0    cli_2.0.2           
[19] bookdown_0.20        checkmate_2.0.0      stringr_1.4.0       
[22] digest_0.6.25        rmarkdown_2.4        unbalanced_2.0      
[25] pkgconfig_2.0.3      htmltools_0.5.0      lhs_1.1.0           
[28] rlang_0.4.7          rstudioapi_0.11      BBmisc_1.11         
[31] FNN_1.1.3            generics_0.0.2       ModelMetrics_1.2.2.2
[34] magrittr_1.5         ROSE_0.0-3           Matrix_1.2-18       
[37] fansi_0.4.1          Rcpp_1.0.5           munsell_0.5.0       
[40] GPfit_1.0-8          lifecycle_0.2.0      furrr_0.1.0         
[43] stringi_1.5.3        pROC_1.16.2          yaml_2.2.1          
[46] MASS_7.3-53          plyr_1.8.6           grid_4.0.1          
[49] parallel_4.0.1       listenv_0.8.0        crayon_1.3.4        
[52] splines_4.0.1        hms_0.5.3            knitr_1.30          
[55] mlr_2.17.1           pillar_1.4.6         ranger_0.12.1       
[58] reshape2_1.4.4       codetools_0.2-16     stats4_4.0.1        
[61] fastmatch_1.1-0      glue_1.4.2           evaluate_0.14       
[64] ParamHelpers_1.14    blogdown_0.20        data.table_1.13.0   
[67] vctrs_0.3.4          foreach_1.5.0        gtable_0.3.0        
[70] RANN_2.6.1           future_1.19.1        assertthat_0.2.1    
[73] xfun_0.18            gower_0.2.2          prodlim_2019.11.13  
[76] e1071_1.7-3          class_7.3-17         survival_3.2-7      
[79] timeDate_3043.102    iterators_1.0.12     hardhat_0.1.4       
[82] lava_1.6.8           globals_0.13.0       ellipsis_0.3.1      
[85] ipred_0.9-9

Count data Models

Mon, 06 Jan 2020 00:00:00 +0000

1 Introduction:

When we deal with data that has a response variable of integer type, using a linear regression may violate the normality assumption and hence all the classical statistic tests would fail to evaluate the model. However, as we do with logistic regression models, the generalized linear model GLM can be used instead here by specifying the suitable distribution.

The possible distributions for this type of data are the discrete distributions poisson and negative binomial. The former is the best choice if the mean and the variance of the response variable are closer to each other, if they are not however and we persist using this distribution we may cause the rise of the overdispersion problem of the residuals. As a solution thus, we can use the latter distribution that does not have this restriction.

There is another alternative if neither the poisson distribution nor the negative binomial are suitable called the Quasi maximum likelihood. The advantage of this method is that uses only the relationship between the mean and the variance and does not require any prespecified distribution. Moreover, its estimators are approximately as efficient as the maximum likelihood estimators.

2 Data preparation

To well understand how to model the count data we are going be using Doctorvisits data from AER package, in which the variable visits will be our target variable, so let’s call this data with the packages that we need along this article.

ssh <- suppressPackageStartupMessages
ssh(library(performance))

Warning: package 'performance' was built under R version 4.0.2

ssh(library(ModelMetrics))
ssh(library(corrr))
ssh(library(purrr))
ssh(library(MASS))

Warning: package 'MASS' was built under R version 4.0.2

ssh(library(tidyverse))

Warning: package 'ggplot2' was built under R version 4.0.2

Warning: package 'tibble' was built under R version 4.0.2

Warning: package 'tidyr' was built under R version 4.0.2

Warning: package 'dplyr' was built under R version 4.0.2

ssh(library(AER))

Warning: package 'car' was built under R version 4.0.2

Warning: package 'lmtest' was built under R version 4.0.2

Warning: package 'sandwich' was built under R version 4.0.2

Warning: package 'survival' was built under R version 4.0.2

ssh(library(broom))

Warning: package 'broom' was built under R version 4.0.2

data("DoctorVisits")
doc <- DoctorVisits
glimpse(doc)

Rows: 5,190
Columns: 12
$ visits    <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 2, 1, 2, ...
$ gender    <fct> female, female, male, male, male, female, female, female,...
$ age       <dbl> 0.19, 0.19, 0.19, 0.19, 0.19, 0.19, 0.19, 0.19, 0.19, 0.1...
$ income    <dbl> 0.55, 0.45, 0.90, 0.15, 0.45, 0.35, 0.55, 0.15, 0.65, 0.1...
$ illness   <dbl> 1, 1, 3, 1, 2, 5, 4, 3, 2, 1, 1, 2, 3, 4, 3, 2, 1, 1, 1, ...
$ reduced   <dbl> 4, 2, 0, 0, 5, 1, 0, 0, 0, 0, 0, 0, 13, 7, 1, 0, 0, 1, 0,...
$ health    <dbl> 1, 1, 0, 0, 1, 9, 2, 6, 5, 0, 0, 2, 1, 6, 0, 7, 5, 0, 0, ...
$ private   <fct> yes, yes, no, no, no, no, no, no, yes, yes, no, no, no, n...
$ freepoor  <fct> no, no, no, no, no, no, no, no, no, no, no, no, no, no, n...
$ freerepat <fct> no, no, no, no, no, no, no, no, no, no, no, yes, no, no, ...
$ nchronic  <fct> no, no, no, no, yes, yes, no, no, no, no, no, no, yes, ye...
$ lchronic  <fct> no, no, no, no, no, no, no, no, no, no, no, no, no, no, n...

This data from Australian health survey where visits is the number of doctor visits in past two weeks with 11 features listed above.

First we list the summary of the data to inspect any unwanted issue.

summary(doc)

     visits          gender          age             income      
 Min.   :0.0000   male  :2488   Min.   :0.1900   Min.   :0.0000  
 1st Qu.:0.0000   female:2702   1st Qu.:0.2200   1st Qu.:0.2500  
 Median :0.0000                 Median :0.3200   Median :0.5500  
 Mean   :0.3017                 Mean   :0.4064   Mean   :0.5832  
 3rd Qu.:0.0000                 3rd Qu.:0.6200   3rd Qu.:0.9000  
 Max.   :9.0000                 Max.   :0.7200   Max.   :1.5000  
    illness         reduced            health       private    freepoor  
 Min.   :0.000   Min.   : 0.0000   Min.   : 0.000   no :2892   no :4968  
 1st Qu.:0.000   1st Qu.: 0.0000   1st Qu.: 0.000   yes:2298   yes: 222  
 Median :1.000   Median : 0.0000   Median : 0.000                        
 Mean   :1.432   Mean   : 0.8619   Mean   : 1.218                        
 3rd Qu.:2.000   3rd Qu.: 0.0000   3rd Qu.: 2.000                        
 Max.   :5.000   Max.   :14.0000   Max.   :12.000                        
 freerepat  nchronic   lchronic  
 no :4099   no :3098   no :4585  
 yes:1091   yes:2092   yes: 605

As we see we do not have missing values and the visits values ranges from 0 to 9 but it should be of integer type rather than double. Similarly, the variable illness should be converted to factor type since it has a few different values.

doc$visits<-as.integer(doc$visits)
doc$illness <- as.factor(doc$illness)
tab <- table(doc$visits)
tab


   0    1    2    3    4    5    6    7    8    9 
4141  782  174   30   24    9   12   12    5    1

The best thing we do to start analyzing the data is by displaying the correlation coefficient of each pair variables we have. Thus, any particular predictor that has high correlation with the target variable could be highly likely to be relevant in our future model. Notice that our target variable is not continuous hence we will use the spearman correlation. As required by correlate function from corrr package, all the variables must be of numeric type so we convert all the factor to integer.

doc1 <-modify_if(doc, is.factor, as.integer)

notice that we have stored the result in another object doc1 to keep save our original data.

M <- correlate(doc1, method="spearman")
rplot(shave(M), colours=c("red", "white", "blue" ))+
   theme(axis.text.x = element_text(angle = 90, hjust = 1))

Looking at this plot all the correlations has low values. however, these correlations assess only the monotonic relations, they say nothing about any other form of relation.
First let’s compare the empirical distribution of the variable visits and the theoretical poisson distribution with $\lambda$ equals the visits mean 0.3017341, and the total number of observations is 5190.

pos <- dpois(0:9,0.302)*5190
both <- numeric(20)
both[1:20 %% 2 != 0] <- tab
both[1:20 %% 2 == 0] <- pos
labels<-character(20)
labels[1:20 %% 2==0]<-as.character(0:9)
barplot(both,col=rep(c("red","yellow"),10),names=labels)

As we see the two distributions are more or less closer to each other. Let’s now check the negative binomial distribution by first estimate the clumping parameter $k=\frac{\bar x^2}{s^2-\bar x}$.

k<-mean(doc$visits)^2/(var(doc$visits)-mean(doc$visits))
bin<-dnbinom(0:9,0.27,mu=0.302)*5190
both1<-numeric(20)
both1[1:20 %% 2 != 0]<-tab
both1[1:20 %% 2 == 0]<-bin
labels<-character(20)
labels[1:20 %% 2==0]<-as.character(0:9)
barplot(both1,col=rep(c("red","yellow"),10),names=labels)

With this distribution it seems that the empiricall distribution is more closer to the negative binomial than the poisson distribution.

Note: This data has very large number of zeros for the outcome compared to the other values which means that any trained model that does not take into account this anomaly will be biased to predict more likely the zero value. However, at the end of this article I will show two famous models to handel this type of count data called Haurdle model and zero_inflated model.

3 Data partition

In oreder to evaluate our model we held out 20% of the data as testing set.

set.seed(123)
index<-sample(2,nrow(doc),replace = TRUE,p=c(.8,.2))
train<-doc[index==1,]
test<-doc[index==2,]

4 Poisson model

This model belongs to the generalized linear model families, so in the function glm we set the argument family to poisson. In practice this model is sufficient with a wide range of count data.

set.seed(123)
model1<-glm(visits~., data=train, family ="poisson")
tidy(model1)

# A tibble: 16 x 5
   term         estimate std.error statistic   p.value
   <chr>           <dbl>     <dbl>     <dbl>     <dbl>
 1 (Intercept)   -2.70     0.141     -19.2   9.14e- 82
 2 genderfemale   0.193    0.0620      3.11  1.88e-  3
 3 age            0.436    0.184       2.37  1.77e-  2
 4 income        -0.161    0.0928     -1.74  8.23e-  2
 5 illness1       0.944    0.113       8.35  6.76e- 17
 6 illness2       1.21     0.118      10.3   1.15e- 24
 7 illness3       1.11     0.132       8.43  3.51e- 17
 8 illness4       1.28     0.140       9.13  7.14e- 20
 9 illness5       1.44     0.139      10.4   2.34e- 25
10 reduced        0.126    0.00560    22.6   6.85e-113
11 health         0.0348   0.0112      3.10  1.91e-  3
12 privateyes     0.111    0.0795      1.39  1.64e-  1
13 freepooryes   -0.344    0.190      -1.81  7.00e-  2
14 freerepatyes   0.0377   0.104       0.363 7.16e-  1
15 nchronicyes    0.0186   0.0732      0.254 7.99e-  1
16 lchronicyes    0.0255   0.0916      0.279 7.81e-  1

As we see all the variables are significant except for the income so we remove this variable and reestimate again.

set.seed(123)
model1<-glm(visits~.-income, data=train, family ="poisson")
tidy(model1)

# A tibble: 15 x 5
   term         estimate std.error statistic   p.value
   <chr>           <dbl>     <dbl>     <dbl>     <dbl>
 1 (Intercept)   -2.83     0.121     -23.4   7.36e-121
 2 genderfemale   0.213    0.0609      3.51  4.56e-  4
 3 age            0.479    0.183       2.62  8.91e-  3
 4 illness1       0.946    0.113       8.38  5.44e- 17
 5 illness2       1.21     0.118      10.3   8.29e- 25
 6 illness3       1.12     0.132       8.50  1.93e- 17
 7 illness4       1.28     0.140       9.17  4.71e- 20
 8 illness5       1.45     0.139      10.5   1.05e- 25
 9 reduced        0.126    0.00560    22.6   1.12e-112
10 health         0.0350   0.0112      3.11  1.84e-  3
11 privateyes     0.100    0.0793      1.27  2.06e-  1
12 freepooryes   -0.290    0.188      -1.55  1.22e-  1
13 freerepatyes   0.0683   0.102       0.667 5.05e-  1
14 nchronicyes    0.0171   0.0731      0.235 8.15e-  1
15 lchronicyes    0.0282   0.0914      0.308 7.58e-  1

For the interpretation of the coefficient estimates, we should exponentiate these values to get the marginal effect since the poisson model uses the log link function to preclude negative values. For continuous predictor, say age, if this predictor increases by one year, ceteris-paribus, we expect the doctor visits will be $exp(0.47876624)=1.614082$ times larger. whereas, for categorical predictor, say gender, the female has $exp(0.21342446)=1.23791$ larger doctor visits than male.
By looking at p-values all the predictors are significant. However, we have to check other statistics and metrics.

glance(model1)

# A tibble: 1 x 8
  null.deviance df.null logLik   AIC   BIC deviance df.residual  nobs
          <dbl>   <int>  <dbl> <dbl> <dbl>    <dbl>       <int> <int>
1         4565.    4154 -2685. 5399. 5494.    3486.        4140  4155

since the deviance value 3485.905 is lower than the degrees of freedom 4140, we will then worry about overdispersion problem. Fortunateley, the AER package provides a super easy way to test the significance of this difference via the function dispersiontest.

dispersiontest(model1)


    Overdispersion test

data:  model1
z = 6.278, p-value = 1.714e-10
alternative hypothesis: true dispersion is greater than 1
sample estimates:
dispersion 
  1.397176

If our target variable really follows poisson distribution then its variance $V$ should be approximately equal to its mean $\mu$, which is the null hypothesis of the following dispersiontest test against the alternative hypothesis that the variance of the form: \[V=\mu+\alpha.trafo(\mu)\]

Where the trafo is an hyperparameter that should be specified as an argument of this test. The popular choices for this argument are:

trafo = NULL (default): $V=(1+\alpha)\mu$
trafo = 1: $V=\mu+\alpha.\mu$
trafo = 2: $V=\mu+\alpha.\mu^2$

For the first choice if true, then the data will be better modeled by quasi-poisson model than poisson model. For the last ones if one of them is true then the negative binomial will be better than poisson model.
Now once the trafo is defined the test estimates $\alpha$, such that:

if $\alpha = 0$ : equidispersion (The null hypothesis)
if $\alpha < 0$ : underdispersion
if $\alpha > 0$ : overdispersion

Therefore, the result of the test will depend on the direction of the test, where we have two.sided, greater (default) for the overdispersion, and less for underdispersion.

With this in mind the output of the above test (with the default values) tested the overdispersion against the quasi-poisson model, and since the p-value is very small 1.714e-10 then we have overdispersion problem, suggesting the use of quasi-poisson model instead.

Now let’s test the negative binomial now.

dispersiontest(model1, trafo = 1)


    Overdispersion test

data:  model1
z = 6.278, p-value = 1.714e-10
alternative hypothesis: true alpha is greater than 0
sample estimates:
    alpha 
0.3971763

The test suggested the use of negative binomial with linear function for the variance with very tiny p-value 1.714e-10. This model is known as NB1 (with linear variance function).

dispersiontest(model1, trafo = 2)


    Overdispersion test

data:  model1
z = 7.4723, p-value = 3.939e-14
alternative hypothesis: true alpha is greater than 0
sample estimates:
  alpha 
0.95488

If the relation is in quadratic form then this model is called NB2. And since this p-value 3.939e-14 is smaller than the previous one then NB2 could be more appropriate than NB1.

5 Quasi poisson model

The first test suggested the use of quasi-poisson model, so let’s train this model with the same predictors as the previous one.

set.seed(123)
model2<-glm(visits~.-income, data=train, family ="quasipoisson")
tidy(model2)

# A tibble: 15 x 5
   term         estimate std.error statistic  p.value
   <chr>           <dbl>     <dbl>     <dbl>    <dbl>
 1 (Intercept)   -2.83     0.140     -20.2   1.25e-86
 2 genderfemale   0.213    0.0705      3.03  2.47e- 3
 3 age            0.479    0.212       2.26  2.39e- 2
 4 illness1       0.946    0.131       7.24  5.36e-13
 5 illness2       1.21     0.136       8.89  9.15e-19
 6 illness3       1.12     0.152       7.34  2.49e-13
 7 illness4       1.28     0.162       7.92  2.91e-15
 8 illness5       1.45     0.160       9.06  2.01e-19
 9 reduced        0.126    0.00647    19.5   4.79e-81
10 health         0.0350   0.0130      2.69  7.14e- 3
11 privateyes     0.100    0.0917      1.09  2.74e- 1
12 freepooryes   -0.290    0.217      -1.34  1.81e- 1
13 freerepatyes   0.0683   0.119       0.576 5.64e- 1
14 nchronicyes    0.0171   0.0846      0.203 8.39e- 1
15 lchronicyes    0.0282   0.106       0.266 7.90e- 1

This model uses the quasi maximum likelihood which gives the same coefficient estimates but with different (corrected) standard errors.
Since here also all the variables are significant We see that the models are the same except the correction of the standard errors which are now more larger. In other words, the poisson distribution under overdispersion underestimates the standard errors and hence the t test would be biased towards the rejection of the null hypothesis. To better understand what is going on with quasi-poisson model let’s put the estimates and the standard errors of both models into one table, and we add a column that resulted from dividing the second standard errors vector by the first one.

D1 <- tidy(model1)
colnames(D1) <- NULL
D2 <- tidy(model2)
colnames(D2) <- NULL
tibble(term=D1[[1]], estimate1=D1[[2]], std1=D1[[3]],estimate2=D2[[2]], std2=D2[[3]], dispersion= std2/std1)

# A tibble: 15 x 6
   term         estimate1    std1 estimate2    std2 dispersion
   <chr>            <dbl>   <dbl>     <dbl>   <dbl>      <dbl>
 1 (Intercept)    -2.83   0.121     -2.83   0.140         1.16
 2 genderfemale    0.213  0.0609     0.213  0.0705        1.16
 3 age             0.479  0.183      0.479  0.212         1.16
 4 illness1        0.946  0.113      0.946  0.131         1.16
 5 illness2        1.21   0.118      1.21   0.136         1.16
 6 illness3        1.12   0.132      1.12   0.152         1.16
 7 illness4        1.28   0.140      1.28   0.162         1.16
 8 illness5        1.45   0.139      1.45   0.160         1.16
 9 reduced         0.126  0.00560    0.126  0.00647       1.16
10 health          0.0350 0.0112     0.0350 0.0130        1.16
11 privateyes      0.100  0.0793     0.100  0.0917        1.16
12 freepooryes    -0.290  0.188     -0.290  0.217         1.16
13 freerepatyes    0.0683 0.102      0.0683 0.119         1.16
14 nchronicyes     0.0171 0.0731     0.0171 0.0846        1.16
15 lchronicyes     0.0282 0.0914     0.0282 0.106         1.16

Note: The first two columns are for the model1, and the last one are for the model 2. Not surprisingly that the result of the last column is constant since this is exactly what the quasi maximum likelihood does, it computes the corrected standard errors from the original ones as follows $std2=dispersion*std1$, with the dispersion value being estimated as 1.15718. if you want to know where this value came from, the answer is simple. this model computes the sigma of the standardized residuals resulted from the original model. we can thus get this value by specifying the argument type to pear then computing sigma by hand as follows:

resid <- resid(model1, type = "pear")
sqrt(sum(resid^2)/4140)

[1] 1.15718

Now to test the prediction qualities of our models we use the testing set test by ploting the original and the predicted values. Let’s start by the model1

pred<- predict.glm(model1,newdata=test[test$visits!=0,],type = "response")
plot(test$visits[test$visits!=0],type = "b",col="red")
lines(round(pred),col="blue")

If you noticed, and due to the large number of zero’s of the target variable, i have intentionally removed all theses values in order to get clearer plot. From this plot we can say that the model does not fit well the data especially the larger values that are not well captured, however this may due to the fact that the data are very skewed towards zero.

To compare different models we can use the root mean-square error and mean absolute error (all the data with zero’s included). Note: Here we are using the rmse function from ModelMetrics that expects the inpute to be two vectors, and not that with the same name from the performance package that expects the input to be a model object . To avoid thus any such ambiguity you should type this command ModelMetrics::rmse.

pred <- predict.glm(model1, newdata = test, type = "response")
rmsemodelp <- ModelMetrics::rmse(test$visits,round(pred))
maemodelp <- mae(test$visits,round(pred))
rmsemodelp

[1] 0.7381921

maemodelp

[1] 0.284058

By the same way, Now let’s evaluate the quasi-poisson model.

predq<- predict.glm(model2,newdata=test[test$visits!=0,],type = "response")
plot(test$visits[test$visits!=0],type = "b",col="red")
lines(round(predq),col="blue")

This plot does not seem to be very different from the previous plot. The rmse and mae for this model are computed as follows.

predq <- predict.glm(model2,newdata=test, type = "response")
rmsemodelqp <- ModelMetrics::rmse(test$visits,round(predq))
maemodelqp <- mae(test$visits,round(predq))
rmsemodelqp

[1] 0.7381921

maemodelqp

[1] 0.284058

we will not compare this two models until we finish with all the incoming models and we compare all the models at once.

6 Negative binomial model

The negative binomial distribution is used as an alternative for the poisson distribution under overdispersion problem.

set.seed(123)
model3<-glm.nb(visits~.-income, data=train)
summary(model3)


Call:
glm.nb(formula = visits ~ . - income, data = train, init.theta = 0.9715923611, 
    link = log)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.8413  -0.6894  -0.5335  -0.3540   3.6726  

Coefficients:
              Estimate Std. Error z value Pr(>|z|)    
(Intercept)  -2.894820   0.135760 -21.323  < 2e-16 ***
genderfemale  0.258968   0.075352   3.437 0.000589 ***
age           0.511867   0.230297   2.223 0.026240 *  
illness1      0.880644   0.123264   7.144 9.04e-13 ***
illness2      1.171615   0.130240   8.996  < 2e-16 ***
illness3      1.118067   0.149032   7.502 6.28e-14 ***
illness4      1.263367   0.165370   7.640 2.18e-14 ***
illness5      1.378166   0.169907   8.111 5.01e-16 ***
reduced       0.141389   0.008184  17.275  < 2e-16 ***
health        0.041364   0.015029   2.752 0.005918 ** 
privateyes    0.086188   0.095173   0.906 0.365149    
freepooryes  -0.375471   0.223857  -1.677 0.093487 .  
freerepatyes  0.144928   0.127751   1.134 0.256602    
nchronicyes   0.022111   0.087590   0.252 0.800705    
lchronicyes   0.091622   0.114965   0.797 0.425477    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for Negative Binomial(0.9716) family taken to be 1)

    Null deviance: 3208.0  on 4154  degrees of freedom
Residual deviance: 2431.2  on 4140  degrees of freedom
AIC: 5159.6

Number of Fisher Scoring iterations: 1

              Theta:  0.972 
          Std. Err.:  0.103 

 2 x log-likelihood:  -5127.587

As before we visualize the performance of this model as follows.

prednb<- predict.glm(model3,newdata=test[test$visits!=0,],type = "response")
plot(test$visits[test$visits!=0],type = "b",col="red")
lines(round(prednb),col="blue")

Again this plot also seems to be the same as the previous ones, so to figure out which model is best we use statistic metrics.

prednb<- predict.glm(model3,newdata=test,type = "response")
rmsemodelnb<-ModelMetrics::rmse(test$visits,round(prednb))
maemodelnb<-mae(test$visits,round(prednb))
knitr::kable(tibble(rms=rmsemodelnb,mae=maemodelnb))

rms	mae
0.7808085	0.2966184

we will use these ouputs further.

7 Hurdle model

Originally proposed by Mullahy (1986) this model can take into account the fact that the data has more zeros and also can handle the overdispersion problem. It has two components (or steps), truncated count component defined by the chosen discrete distribution such as poisson or negative binomial, and a hurdle components models zero vs larger counts (that uses censored count distribution or binomial model). In other words, this models asumes that two population distributions underlying the data, one distribution for zero values, and another different distribution the psotive values. For more detail about hurdle and zero inflated models click here

To perform this model we make use of the function hurdle from the package pscl.

7.1 hurdle model with poisson distribution.

This model works in two steps. In the first step it uses binary classification to discriminate between the zero values and the positive values, and in the second step uses the traditional (poisson or binomial model, and here we use poisson model) model for positive values.

library(pscl)
set.seed(123)
modelhp<-hurdle(visits~. -income, data=train,dist = "poisson")
summary(modelhp)


Call:
hurdle(formula = visits ~ . - income, data = train, dist = "poisson")

Pearson residuals:
    Min      1Q  Median      3Q     Max 
-1.5464 -0.4686 -0.3306 -0.2075 11.0887 

Count model coefficients (truncated poisson with log link):
              Estimate Std. Error z value Pr(>|z|)    
(Intercept)  -0.977535   0.261835  -3.733 0.000189 ***
genderfemale  0.073326   0.098034   0.748 0.454480    
age           0.032762   0.287991   0.114 0.909427    
illness1      0.370071   0.251920   1.469 0.141833    
illness2      0.403514   0.256363   1.574 0.115489    
illness3      0.201724   0.278757   0.724 0.469277    
illness4      0.420285   0.277573   1.514 0.129990    
illness5      0.762209   0.269809   2.825 0.004728 ** 
reduced       0.111640   0.007967  14.013  < 2e-16 ***
health        0.007682   0.016452   0.467 0.640554    
privateyes   -0.215649   0.129860  -1.661 0.096787 .  
freepooryes   0.066277   0.269699   0.246 0.805879    
freerepatyes -0.434941   0.166196  -2.617 0.008870 ** 
nchronicyes   0.109660   0.125380   0.875 0.381779    
lchronicyes   0.135612   0.142766   0.950 0.342166    
Zero hurdle model coefficients (binomial with logit link):
              Estimate Std. Error z value Pr(>|z|)    
(Intercept)  -3.224621   0.156469 -20.609  < 2e-16 ***
genderfemale  0.305324   0.089648   3.406 0.000660 ***
age           0.700345   0.276089   2.537 0.011191 *  
illness1      0.885148   0.136951   6.463 1.02e-10 ***
illness2      1.238227   0.146059   8.478  < 2e-16 ***
illness3      1.263698   0.169344   7.462 8.50e-14 ***
illness4      1.405167   0.195388   7.192 6.40e-13 ***
illness5      1.445585   0.208425   6.936 4.04e-12 ***
reduced       0.154858   0.013488  11.481  < 2e-16 ***
health        0.070464   0.019142   3.681 0.000232 ***
privateyes    0.271192   0.112751   2.405 0.016163 *  
freepooryes  -0.546177   0.277942  -1.965 0.049406 *  
freerepatyes  0.423220   0.153994   2.748 0.005991 ** 
nchronicyes  -0.006256   0.102033  -0.061 0.951106    
lchronicyes   0.070658   0.140587   0.503 0.615251    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 

Number of iterations in BFGS optimization: 22 
Log-likelihood: -2581 on 30 Df

As we see this output has two tables. The above one is for the poisson model performed only on the truncated positive values, and the below one is the result of the logistic regression with only two classes (zero or positive value)
As we did before we plot the results.

predhp<- predict(modelhp,newdata=test[test$visits!=0,],type = "response")
plot(test$visits[test$visits!=0],type = "b",col="red")
lines(round(predhp),col="blue")

As before by only looking at the plot we can not decide which model is the best. So it is better to use the statistic metrics.

predhp<- predict(modelhp,newdata=test, type = "response")
rmsemodelhp<-ModelMetrics::rmse(test$visits,round(predhp))
maemodelhp<-mae(test$visits,round(predhp))
knitr::kable(tibble(rmse=rmsemodelhp,mae=
maemodelhp))

rmse	mae
0.7375374	0.2850242

7.2 hurdle model with negative binomial distribution.

Now let’s try to use the negative binomial instead of poisson distribution.

set.seed(123)
modelhnb<-hurdle(visits~.-income, data=train,dist = "negbin")
summary(modelhnb)


Call:
hurdle(formula = visits ~ . - income, data = train, dist = "negbin")

Pearson residuals:
    Min      1Q  Median      3Q     Max 
-0.9078 -0.4515 -0.3201 -0.2022 10.6552 

Count model coefficients (truncated negbin with log link):
             Estimate Std. Error z value Pr(>|z|)    
(Intercept)  -3.68462    2.65037  -1.390   0.1645    
genderfemale  0.07432    0.17299   0.430   0.6675    
age           0.26774    0.54001   0.496   0.6200    
illness1      0.27678    0.34564   0.801   0.4233    
illness2      0.37093    0.35241   1.053   0.2925    
illness3      0.04728    0.39747   0.119   0.9053    
illness4      0.40386    0.40517   0.997   0.3189    
illness5      0.68213    0.41357   1.649   0.0991 .  
reduced       0.15813    0.01935   8.171 3.05e-16 ***
health        0.01891    0.03291   0.575   0.5656    
privateyes   -0.45711    0.23118  -1.977   0.0480 *  
freepooryes   0.03334    0.55282   0.060   0.9519    
freerepatyes -0.59189    0.30437  -1.945   0.0518 .  
nchronicyes   0.08737    0.21061   0.415   0.6783    
lchronicyes   0.30274    0.25846   1.171   0.2415    
Log(theta)   -2.80552    2.80120  -1.002   0.3166    
Zero hurdle model coefficients (binomial with logit link):
              Estimate Std. Error z value Pr(>|z|)    
(Intercept)  -3.224621   0.156469 -20.609  < 2e-16 ***
genderfemale  0.305324   0.089648   3.406 0.000660 ***
age           0.700345   0.276089   2.537 0.011191 *  
illness1      0.885148   0.136951   6.463 1.02e-10 ***
illness2      1.238227   0.146059   8.478  < 2e-16 ***
illness3      1.263698   0.169344   7.462 8.50e-14 ***
illness4      1.405167   0.195388   7.192 6.40e-13 ***
illness5      1.445585   0.208425   6.936 4.04e-12 ***
reduced       0.154858   0.013488  11.481  < 2e-16 ***
health        0.070464   0.019142   3.681 0.000232 ***
privateyes    0.271192   0.112751   2.405 0.016163 *  
freepooryes  -0.546177   0.277942  -1.965 0.049406 *  
freerepatyes  0.423220   0.153994   2.748 0.005991 ** 
nchronicyes  -0.006256   0.102033  -0.061 0.951106    
lchronicyes   0.070658   0.140587   0.503 0.615251    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 

Theta: count = 0.0605
Number of iterations in BFGS optimization: 31 
Log-likelihood: -2524 on 31 Df

And let’s plot the difference between the predicted and the actual values of the testing set. .

predhnb<- predict(modelhnb,newdata=test[test$visits!=0,],type = "response")
plot(test$visits[test$visits!=0],type = "b",col="red")
lines(round(predhnb),col="blue")

And for the metrics.

predhnb<- predict(modelhnb,newdata=test,type = "response")
rmsemodelhnb<-ModelMetrics::rmse(test$visits,round(predhnb))
maemodelhnb<-mae(test$visits,round(predhnb))
knitr::kable(tibble(rmse=rmsemodelhnb,mae=
maemodelhnb))

rmse	mae
0.7408052	0.2879227

8 Zero inflated model

Such as the previous model type , this model also combines two components but with the difference that this model performs a mixture of binomial distribution (between zero and positive values) and the poisson (or negative binomial) distribution for the rest of the values (with the zero included).

8.1 Zero inflated model with poisson distribution

Here also we fit tow models one with poisson and one with negative binomial

set.seed(123)
modelzp<-zeroinfl(visits~.-income, data=train,dist = "poisson")
summary(modelzp)


Call:
zeroinfl(formula = visits ~ . - income, data = train, dist = "poisson")

Pearson residuals:
    Min      1Q  Median      3Q     Max 
-1.6247 -0.4791 -0.3326 -0.1783 12.3448 

Count model coefficients (poisson with log link):
             Estimate Std. Error z value Pr(>|z|)    
(Intercept)  -0.86467    0.23608  -3.663  0.00025 ***
genderfemale  0.03280    0.09078   0.361  0.71789    
age           0.13331    0.26922   0.495  0.62049    
illness1      0.32986    0.21846   1.510  0.13105    
illness2      0.34800    0.22426   1.552  0.12071    
illness3      0.20400    0.24152   0.845  0.39832    
illness4      0.44020    0.24324   1.810  0.07034 .  
illness5      0.72463    0.23632   3.066  0.00217 ** 
reduced       0.09679    0.00809  11.964  < 2e-16 ***
health        0.02269    0.01609   1.410  0.15860    
privateyes   -0.26390    0.12796  -2.062  0.03918 *  
freepooryes   0.04860    0.27675   0.176  0.86059    
freerepatyes -0.51894    0.17070  -3.040  0.00237 ** 
nchronicyes   0.08577    0.11490   0.746  0.45538    
lchronicyes   0.10876    0.12745   0.853  0.39346    

Zero-inflation model coefficients (binomial with logit link):
             Estimate Std. Error z value Pr(>|z|)    
(Intercept)   2.44812    0.34933   7.008 2.42e-12 ***
genderfemale -0.48766    0.19038  -2.562 0.010422 *  
age          -0.88816    0.59030  -1.505 0.132431    
illness1     -0.80833    0.31248  -2.587 0.009685 ** 
illness2     -1.41461    0.35338  -4.003 6.25e-05 ***
illness3     -1.69204    0.44028  -3.843 0.000121 ***
illness4     -1.52224    0.46334  -3.285 0.001019 ** 
illness5     -1.08742    0.46493  -2.339 0.019342 *  
reduced      -0.14462    0.03861  -3.746 0.000180 ***
health       -0.05796    0.04486  -1.292 0.196386    
privateyes   -0.73945    0.22597  -3.272 0.001066 ** 
freepooryes   0.73371    0.41402   1.772 0.076370 .  
freerepatyes -1.75454    0.53938  -3.253 0.001142 ** 
nchronicyes   0.13229    0.22623   0.585 0.558697    
lchronicyes   0.03647    0.30620   0.119 0.905194    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 

Number of iterations in BFGS optimization: 42 
Log-likelihood: -2579 on 30 Df

predzp<- predict(modelzp,newdata=test[test$visits!=0,],type = "response")
plot(test$visits[test$visits!=0],type = "b",col="red")
lines(round(predzp),col="blue")

predzp<- predict(modelzp,newdata=test,type = "response")
rmsemodelzp<-ModelMetrics::rmse(test$visits,round(predzp))
maemodelzp<-mae(test$visits,round(predzp))
knitr::kable(tibble(rmse=rmsemodelzp,mae=
maemodelzp))

rmse	mae
0.7485897	0.2898551

8.2 Zero inflated model with negative binomial distribution

Let’s this time try the negative binomial distribution.

set.seed(123)
modelznb<-zeroinfl(visits~., data=train,dist = "negbin")
summary(modelznb)


Call:
zeroinfl(formula = visits ~ ., data = train, dist = "negbin")

Pearson residuals:
    Min      1Q  Median      3Q     Max 
-1.0440 -0.4582 -0.3031 -0.1680 14.2061 

Count model coefficients (negbin with log link):
              Estimate Std. Error z value Pr(>|z|)    
(Intercept)  -1.383100   0.241778  -5.721 1.06e-08 ***
genderfemale  0.038985   0.090534   0.431 0.666751    
age           0.028837   0.290843   0.099 0.921019    
income       -0.196592   0.147560  -1.332 0.182767    
illness1      0.613258   0.174652   3.511 0.000446 ***
illness2      0.692297   0.179663   3.853 0.000117 ***
illness3      0.664613   0.196061   3.390 0.000699 ***
illness4      0.760162   0.204137   3.724 0.000196 ***
illness5      0.944756   0.206097   4.584 4.56e-06 ***
reduced       0.102651   0.008776  11.697  < 2e-16 ***
health        0.044012   0.015536   2.833 0.004611 ** 
privateyes   -0.168864   0.138680  -1.218 0.223358    
freepooryes  -0.422748   0.306653  -1.379 0.168022    
freerepatyes -0.383558   0.163995  -2.339 0.019344 *  
nchronicyes   0.033374   0.107881   0.309 0.757048    
lchronicyes   0.065834   0.128987   0.510 0.609775    
Log(theta)    0.473936   0.142626   3.323 0.000891 ***

Zero-inflation model coefficients (binomial with logit link):
               Estimate Std. Error z value Pr(>|z|)    
(Intercept)   2.280e+00  5.322e-01   4.285 1.83e-05 ***
genderfemale -7.269e-01  2.753e-01  -2.640  0.00828 ** 
age          -2.003e+00  9.202e-01  -2.177  0.02951 *  
income       -1.803e-01  3.933e-01  -0.458  0.64669    
illness1     -3.327e-01  3.480e-01  -0.956  0.33894    
illness2     -1.112e+00  4.496e-01  -2.473  0.01339 *  
illness3     -9.533e-01  5.127e-01  -1.859  0.06297 .  
illness4     -1.551e+00  7.398e-01  -2.097  0.03599 *  
illness5     -1.230e+00  8.597e-01  -1.431  0.15257    
reduced      -1.298e+00  4.577e-01  -2.836  0.00456 ** 
health       -1.443e-03  5.509e-02  -0.026  0.97910    
privateyes   -8.179e-01  3.178e-01  -2.574  0.01005 *  
freepooryes   2.394e-01  6.648e-01   0.360  0.71878    
freerepatyes -1.572e+01  1.528e+03  -0.010  0.99179    
nchronicyes   4.502e-02  2.982e-01   0.151  0.88001    
lchronicyes  -1.637e-01  4.951e-01  -0.331  0.74085    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 

Theta = 1.6063 
Number of iterations in BFGS optimization: 66 
Log-likelihood: -2512 on 33 Df

predznb<- predict(modelznb,newdata=test,type = "response")
rmsemodelznb<-ModelMetrics::rmse(test$visits,round(predznb))
maemodelznb<-mae(test$visits,round(predznb))
knitr::kable(tibble(rmse=rmsemodelznb,mae=maemodelznb))

rmse	mae
0.7309579	0.2753623

Finally let’s compare all the above models.

rmse<-c(rmsemodelp,rmsemodelqp,rmsemodelnb,rmsemodelhp,rmsemodelhnb,
           rmsemodelzp,rmsemodelznb)
mae<-c(maemodelp,maemodelqp,maemodelnb,maemodelhp,maemodelhnb,
           maemodelzp,maemodelznb)
models<-c("pois","q_pois","nb","h_pois","h_nb","zer_pois","zer_nb")

data.frame(models,rmse,mae)%>% 
  arrange(rmse)

    models      rmse       mae
1   zer_nb 0.7309579 0.2753623
2   h_pois 0.7375374 0.2850242
3     pois 0.7381921 0.2840580
4   q_pois 0.7381921 0.2840580
5     h_nb 0.7408052 0.2879227
6 zer_pois 0.7485897 0.2898551
7       nb 0.7808085 0.2966184

Both metrics have chosen the zero inflated negative binomial model as the best model with minimum rmse value 0.7309579, and minimum mae value 0.2753623. this result is in line with the fact that this kind of models take care of the zero inflated data and at the same time the overdispersion problem.

9 Conclusion:

If the data is truly follows Poisson distribution then all the the other models have extra parameters that, during training process, converges to the optimum parameter values for poisson, this relation is like the linear regression to the generalized least squares. However, if the data is very skewed towards zero then it should be better to use the last two models to take care of this issue.

10 Furhter reading:

Michael J. Crawley, The R book, WILEY, UK, 2013. http://www.bio.ic.ac.uk/research/mjcraw/therbook/index.htm

11 Session info

sessionInfo()

R version 4.0.1 (2020-06-06)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19041)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252 
[2] LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] pscl_1.5.5           broom_0.7.1          AER_1.2-9           
 [4] survival_3.2-7       sandwich_3.0-0       lmtest_0.9-38       
 [7] zoo_1.8-8            car_3.0-10           carData_3.0-4       
[10] forcats_0.5.0        stringr_1.4.0        dplyr_1.0.2         
[13] readr_1.3.1          tidyr_1.1.2          tibble_3.0.3        
[16] ggplot2_3.3.2        tidyverse_1.3.0      MASS_7.3-53         
[19] purrr_0.3.4          corrr_0.4.2          ModelMetrics_1.2.2.2
[22] performance_0.5.0   

loaded via a namespace (and not attached):
 [1] httr_1.4.2        jsonlite_1.7.1    splines_4.0.1     modelr_0.1.8     
 [5] Formula_1.2-3     assertthat_0.2.1  highr_0.8         blob_1.2.1       
 [9] cellranger_1.1.0  yaml_2.2.1        bayestestR_0.7.2  pillar_1.4.6     
[13] backports_1.1.10  lattice_0.20-41   glue_1.4.2        digest_0.6.25    
[17] rvest_0.3.6       colorspace_1.4-1  htmltools_0.5.0   Matrix_1.2-18    
[21] pkgconfig_2.0.3   haven_2.3.1       bookdown_0.20     scales_1.1.1     
[25] openxlsx_4.2.2    rio_0.5.16        farver_2.0.3      generics_0.0.2   
[29] ellipsis_0.3.1    withr_2.3.0       cli_2.0.2         magrittr_1.5     
[33] crayon_1.3.4      readxl_1.3.1      evaluate_0.14     fs_1.5.0         
[37] fansi_0.4.1       xml2_1.3.2        foreign_0.8-80    blogdown_0.20    
[41] tools_4.0.1       data.table_1.13.0 hms_0.5.3         lifecycle_0.2.0  
[45] munsell_0.5.0     reprex_0.3.0      zip_2.1.1         compiler_4.0.1   
[49] rlang_0.4.7       grid_4.0.1        rstudioapi_0.11   labeling_0.3     
[53] rmarkdown_2.4     gtable_0.3.0      abind_1.4-5       DBI_1.1.0        
[57] curl_4.3          R6_2.4.1          lubridate_1.7.9   knitr_1.30       
[61] utf8_1.1.4        insight_0.9.6     stringi_1.5.3     Rcpp_1.0.5       
[65] vctrs_0.3.4       dbplyr_1.4.4      tidyselect_1.1.0  xfun_0.18

Xgboost model

Sun, 05 Jan 2020 00:00:00 +0000

1 Introduction

Decision tree¹ is a model that recursively splits the input space into regions and defines local model for each resulted region. However, fitting decision tree model to complex data would not yield to accurate prediction in most cases, which can be termed as weak learner. But combining multiple decision trees together (called also ensemble models) using techniques such as aggregating and boosting can largely improve the model accuracy. Xgboost (short for Extreme gradient boosting) model is a tree-based algorithm that uses these types of techniques. It can be used for both classification and regression. In this paper we learn how to implement this model to predict the well known titanic data as we did in the previous papers using different kind of models.

2 Data preparation

First we start by calling the packages needed and the titanic data

suppressPackageStartupMessages(library(tidyverse))

## Warning: package 'ggplot2' was built under R version 4.0.2

## Warning: package 'tibble' was built under R version 4.0.2

## Warning: package 'tidyr' was built under R version 4.0.2

## Warning: package 'dplyr' was built under R version 4.0.2

suppressPackageStartupMessages(library(caret))
data <- read_csv("../train.csv")

## Parsed with column specification:
## cols(
##   PassengerId = col_double(),
##   Survived = col_double(),
##   Pclass = col_double(),
##   Name = col_character(),
##   Sex = col_character(),
##   Age = col_double(),
##   SibSp = col_double(),
##   Parch = col_double(),
##   Ticket = col_character(),
##   Fare = col_double(),
##   Cabin = col_character(),
##   Embarked = col_character()
## )

Let’s take a look at this data using the dplyr function glimpse.

glimpse(data)

## Rows: 891
## Columns: 12
## $ PassengerId <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, ...
## $ Survived    <dbl> 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0...
## $ Pclass      <dbl> 3, 1, 3, 1, 3, 3, 1, 3, 3, 2, 3, 1, 3, 3, 3, 2, 3, 2, 3...
## $ Name        <chr> "Braund, Mr. Owen Harris", "Cumings, Mrs. John Bradley ...
## $ Sex         <chr> "male", "female", "female", "female", "male", "male", "...
## $ Age         <dbl> 22, 38, 26, 35, 35, NA, 54, 2, 27, 14, 4, 58, 20, 39, 1...
## $ SibSp       <dbl> 1, 1, 0, 1, 0, 0, 0, 3, 0, 1, 1, 0, 0, 1, 0, 0, 4, 0, 1...
## $ Parch       <dbl> 0, 0, 0, 0, 0, 0, 0, 1, 2, 0, 1, 0, 0, 5, 0, 0, 1, 0, 0...
## $ Ticket      <chr> "A/5 21171", "PC 17599", "STON/O2. 3101282", "113803", ...
## $ Fare        <dbl> 7.2500, 71.2833, 7.9250, 53.1000, 8.0500, 8.4583, 51.86...
## $ Cabin       <chr> NA, "C85", NA, "C123", NA, NA, "E46", NA, NA, NA, "G6",...
## $ Embarked    <chr> "S", "C", "S", "S", "S", "Q", "S", "S", "S", "C", "S", ...

For prediction purposes some variables should be removed such as PassengerId, Name, Ticket, and Cabin. While some others should be converted to another suitable type. the following script performs these transformations but for more detail you can refer to my previous paper of logistic regression.

mydata<-data[,-c(1,4,9,11)]
mydata$Survived<-as.integer(mydata$Survived)
mydata<-modify_at(mydata,c("Pclass","Sex","Embarked","SibSp","Parch"), as.factor)

Now let’s check the summary of the transformed data.

summary(mydata)

##     Survived      Pclass      Sex           Age        SibSp   Parch  
##  Min.   :0.0000   1:216   female:314   Min.   : 0.42   0:608   0:678  
##  1st Qu.:0.0000   2:184   male  :577   1st Qu.:20.12   1:209   1:118  
##  Median :0.0000   3:491                Median :28.00   2: 28   2: 80  
##  Mean   :0.3838                        Mean   :29.70   3: 16   3:  5  
##  3rd Qu.:1.0000                        3rd Qu.:38.00   4: 18   4:  4  
##  Max.   :1.0000                        Max.   :80.00   5:  5   5:  5  
##                                        NA's   :177     8:  7   6:  1  
##       Fare        Embarked  
##  Min.   :  0.00   C   :168  
##  1st Qu.:  7.91   Q   : 77  
##  Median : 14.45   S   :644  
##  Mean   : 32.20   NA's:  2  
##  3rd Qu.: 31.00             
##  Max.   :512.33             
##

As we see, we have 177 missing values from age variable and 2 values from Embarked. For missing values we have two strategies, removing completely the missing values from the analysis, but doing so we will lose many data, or imputing them by one of the available imputation method to fix these values. Since we have large number of missing values compared to the total examples in the data it would be better to follow the latter strategy. Thankfully to mice package that is a very powerfull for this purpose and it provides many imputation methods for all variable types. We will opt for random forest method since in most cases can be the best choice. However, in order to respect the most important rule in machine learning, never touch the test data during the training process , we will apply this imputation after splitting the data.

3 Data visualization

We have many tools outside modelization to investigate some relationships between variables like visualization tools. So we can visualize the relationship between each predictor and the target variable using the ggplot2 package.

library(ggplot2)
ggplot(mydata,aes(Sex,Survived,color=Sex))+
  geom_point()+
  geom_jitter()

The left side of the plot shows that higher fraction of females survived, whereas the right side shows the reverse situation for males where most of them died. We can induce from this plot that, ceteris paribus, this predictor is likely to be relevant for prediction.

ggplot(mydata,aes(Pclass,Survived,color=Pclass))+
  geom_point()+
  geom_jitter()

in this plot most of the first class passengers survived in contrast with the third class passengers where most of them died. However, for the second class, it seems equally balanced. Again this predictor also can be relevant.

ggplot(mydata,aes(SibSp,Survived,color=SibSp))+
  geom_point()+
  geom_jitter()

This predictor refers to the number of siblings a passenger has. It seems to be equally distributed given the target variable, and hence can be highly irrelevant. In other words, knowing the number of siblings of a particular passenger does not help to predict if this passenger survived or died.

ggplot(mydata,aes(Parch,Survived,color=Parch))+
  geom_point()+
  geom_jitter()

This predictor refers to the number of parents and children a passenger has. It seems that this predictor is slightly discriminative if we look closely at the level 0, passengers with no parents or children.

ggplot(mydata,aes(Embarked,Survived,color=Embarked))+
  geom_point()+
  geom_jitter()

We see that a passenger who is embarked from the port S is slightly highly to be died, while the other ports seem to be equally distributed.

For numeric variables we use the empirical densitiy givan the target variable as follows.

ggplot(mydata[complete.cases(mydata),], aes(Age,fill=as.factor(Survived)))+
  geom_density(alpha=.5)

We see that some significant overlapping between the two conditional distribution may indicating less relevance related to this variable.

ggplot(mydata, aes(Fare,fill=as.factor(Survived)))+
  geom_density(alpha=.5)

For this variables the conditional distribution are different, we see a spike close to zero reflecting the more death among third class.

we can also plot two predictors against each other. For instance let’s try with the two predictors, Sex and Pclass:

ggplot(mydata,aes(Sex,Pclass,color=as.factor(Survived)))+
  geom_point(col="green",pch=16,cex=7)+
  geom_jitter()

The majority of the survived females (blue points on the left) came from the first and the second class, while the majority of died males (red points on the right) came from the third class.

4 Data partition

we take out 80% of the data as training set and the remaining will be served as testing set.

set.seed(1234)
index<-createDataPartition(mydata$Survived,p=0.8,list=FALSE)
train<-mydata[index,]

## Warning: The `i` argument of ``[`()` can't be a matrix as of tibble 3.0.0.
## Convert to a vector.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.

test<-mydata[-index,]

Now we are ready to impute the missing values.

suppressPackageStartupMessages(library(mice))

## Warning: package 'mice' was built under R version 4.0.2

imput_train<-mice(train,m=3,seed=111, method = 'rf')

## Warning: Number of logged events: 30

train2<-complete(imput_train,1)
summary(train2)

From this output we see that we do not have missing values any more.

5 Model training

The xgboost model expects the predictors to be of numeric type, so we convert the factors to dummy variables by the help of the Matrix package

suppressPackageStartupMessages(library(Matrix))
train_data<-sparse.model.matrix(Survived ~. -1, data=train2)

Note that the -1 value added to the formula is to avoid adding a column as intercept with ones to our data. we can take a look at the structure of the data by the following

str(train_data)

## Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
##   ..@ i       : int [1:3570] 1 3 5 8 17 20 23 24 27 28 ...
##   ..@ p       : int [1:21] 0 178 329 713 1173 1886 2062 2086 2100 2114 ...
##   ..@ Dim     : int [1:2] 713 20
##   ..@ Dimnames:List of 2
##   .. ..$ : chr [1:713] "1" "2" "3" "4" ...
##   .. ..$ : chr [1:20] "Pclass1" "Pclass2" "Pclass3" "Sexmale" ...
##   ..@ x       : num [1:3570] 1 1 1 1 1 1 1 1 1 1 ...
##   ..@ factors : list()

We know that many machine learning algorithms require the inputs to be in a specific type. The input types supported by xgboost algorithm are: matrix, dgCMatrix object rendered from the above package Matrix, or the xgboost class xgb.DMatrix.

suppressPackageStartupMessages(library(xgboost))

## Warning: package 'xgboost' was built under R version 4.0.2

We should first store the dependent variable in a separate vector, let’s call it train_label

train_label<-train$Survived
dim(train_data)

## [1] 713  20

length(train$Survived)

## [1] 713

Now we bind the predictors, contained in the train_data , with the train_label vector as xgb.DMatrix object as follows

train_final<-xgb.DMatrix(data = train_data,label=train_label)

To train the model you must provide the inputs and specify the argument values if we do not want to keep the following values:

objective: for binary classification we use binary:logistic
eta (default=0.3): The learning rate.
gamma (default=0): also called min_split_loss, the minimum loss required for splitting further a particular node.
max_depth(default=6): the maximum depth of the tree.
min_child_weight(default=1): the minimum number of instances required in a node under which the node will be leaf.
subsample (default=1): with the default the model uses all the data at each tree, if 0.7 for instance, then the model randomly sample 70% of the data at each iteration, doing so we fight the overfiting problem.
colsample_bytree (default=1, select all columns): subsample ratio of columns at each iteration.
nthreads (default=2): number of cpu’s used in parallel processing.
nrounds : the number of boosting iterations.

You can check the whole parameters by typing ?xgboost.

It should be noted that the input data can feed into the model by two ways: It the data is of class xgb.DMatrix that contain both the predictors and the label, as we did, then we do not use the label argument. Otherwise, with any other class we provide both argument data and label.

Let’s our first attempt will be made with 40 iterations and the default values for the other arguments.

mymodel <- xgboost(data=train_final, objective = "binary:logistic",
                   nrounds = 40)

## [1]  train-error:0.148668 
## [2]  train-error:0.133240 
## [3]  train-error:0.130435 
## [4]  train-error:0.137447 
## [5]  train-error:0.127630 
## [6]  train-error:0.117812 
## [7]  train-error:0.115007 
## [8]  train-error:0.109397 
## [9]  train-error:0.102384 
## [10] train-error:0.103787 
## [11] train-error:0.103787 
## [12] train-error:0.102384 
## [13] train-error:0.100982 
## [14] train-error:0.098177 
## [15] train-error:0.098177 
## [16] train-error:0.096774 
## [17] train-error:0.096774 
## [18] train-error:0.098177 
## [19] train-error:0.093969 
## [20] train-error:0.091164 
## [21] train-error:0.086957 
## [22] train-error:0.085554 
## [23] train-error:0.085554 
## [24] train-error:0.082749 
## [25] train-error:0.082749 
## [26] train-error:0.082749 
## [27] train-error:0.079944 
## [28] train-error:0.075736 
## [29] train-error:0.074334 
## [30] train-error:0.074334 
## [31] train-error:0.072931 
## [32] train-error:0.072931 
## [33] train-error:0.070126 
## [34] train-error:0.070126 
## [35] train-error:0.070126 
## [36] train-error:0.068724 
## [37] train-error:0.067321 
## [38] train-error:0.061711 
## [39] train-error:0.061711 
## [40] train-error:0.063114

We can plot the error rates as follows

 mymodel$evaluation_log %>%   
  ggplot(aes(iter, train_error))+
  geom_point()

To evaluate the model we will use the test data that should follow all the above steps as the training data except for the missing values. since the test set is only used to evaluate the model so we will remove all the missing values.

test1 <- test[complete.cases(test),]
test2<-sparse.model.matrix(Survived ~. -1,data=test1)
test_label<-test1$Survived
test_final<-xgb.DMatrix(data = test2, label=test_label)

Then we use the predict function and confusionMatrix function from caret package, and since the predicted values are probabbilities we convert them to predicted classes using the threshold of 0.5 as follows:

pred <- predict(mymodel, test_final)
pred<-ifelse(pred>.5,1,0)
confusionMatrix(as.factor(pred),as.factor(test_label))

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 81 13
##          1 11 36
##                                           
##                Accuracy : 0.8298          
##                  95% CI : (0.7574, 0.8878)
##     No Information Rate : 0.6525          
##     P-Value [Acc > NIR] : 2.379e-06       
##                                           
##                   Kappa : 0.6211          
##                                           
##  Mcnemar's Test P-Value : 0.8383          
##                                           
##             Sensitivity : 0.8804          
##             Specificity : 0.7347          
##          Pos Pred Value : 0.8617          
##          Neg Pred Value : 0.7660          
##              Prevalence : 0.6525          
##          Detection Rate : 0.5745          
##    Detection Prevalence : 0.6667          
##       Balanced Accuracy : 0.8076          
##                                           
##        'Positive' Class : 0               
##

with the default values we obtain a pretty good accuracy rate. The next step we fine tune the hyperparameters sing cross validation with the help of caret package.

6 Fine tune the hyperparameters

for the hyperparameters we try different grid values for the above arguments as follows:

eta: seq(0.2,1,0.2)
max_depth: seq(2,6,1)
min_child_weight: c(1,5,10)
colsample_bytree : seq(0.6,1,0.1)
nrounds : c(50,200 ,50)

This requires training the model 375 times.

grid_tune <- expand.grid(
  nrounds = c(50,200,50),
  max_depth = seq(2,6,1),
  eta = seq(0.2,1,0.2),
  gamma = 0,
  min_child_weight = 1,
  colsample_bytree = seq(0.6,1,0.1),
  subsample = 1
  )

Then we use 5 folds cross validation as follows.

control <- trainControl(
  method = "repeatedcv",
  number = 5,
  allowParallel = TRUE
)

Now instead we use the train function from caret to train the model and we specify the method as xgbtree.

train_data1 <- as.matrix(train_data)
train_label1 <- as.factor(train_label)
#mymodel2 <- train(
#  x = train_data1,
#  y = train_label1,
#  trControl = control,
#  tuneGrid = grid_tune,
#  method = "xgbTree")

Note: This model took several minutes so we do not the model to be rerun again when rendering this document that is why i have commented the above script and have saved the results in csv file, then i have reloaded it again to continue our analysis. If you would like to run this model you can just uncomment the script.

# results <- mymodel2$results
# write_csv(results, "xgb_results.csv")
results <- read_csv("xgb_results.csv")

## Parsed with column specification:
## cols(
##   eta = col_double(),
##   max_depth = col_double(),
##   gamma = col_double(),
##   colsample_bytree = col_double(),
##   min_child_weight = col_double(),
##   subsample = col_double(),
##   nrounds = col_double(),
##   Accuracy = col_double(),
##   Kappa = col_double(),
##   AccuracySD = col_double(),
##   KappaSD = col_double()
## )

Let’s now check the best hyperparameter values:

results %>% 
  arrange(-Accuracy) %>% 
  head(5)

## # A tibble: 5 x 11
##     eta max_depth gamma colsample_bytree min_child_weight subsample nrounds
##   <dbl>     <dbl> <dbl>            <dbl>            <dbl>     <dbl>   <dbl>
## 1   0.2         4     0              0.6                1         1      50
## 2   0.2         6     0              0.6                1         1      50
## 3   0.8         2     0              0.8                1         1      50
## 4   0.4         3     0              0.6                1         1      50
## 5   0.2         3     0              1                  1         1     200
## # ... with 4 more variables: Accuracy <dbl>, Kappa <dbl>, AccuracySD <dbl>,
## #   KappaSD <dbl>

As we see the highest accuracy rate is about 81.34% with the related hyperparameter values as follows.

results %>% 
  arrange(-Accuracy) %>% 
  head(1)

## # A tibble: 1 x 11
##     eta max_depth gamma colsample_bytree min_child_weight subsample nrounds
##   <dbl>     <dbl> <dbl>            <dbl>            <dbl>     <dbl>   <dbl>
## 1   0.2         4     0              0.6                1         1      50
## # ... with 4 more variables: Accuracy <dbl>, Kappa <dbl>, AccuracySD <dbl>,
## #   KappaSD <dbl>

Now we apply these values for the final model using the whole data uploadded at the beginning from the train.csv file, and then we call the file test.csv file for titanic data to submit our prediction to the kaggle competition.

imput_mydata<-mice(mydata,m=3,seed=111, method = 'rf')

## Warning: Number of logged events: 15

mydata_imp<-complete(imput_mydata,1)
my_data<-sparse.model.matrix(Survived ~. -1, data = mydata_imp)
mydata_label<-mydata$Survived
data_final<-xgb.DMatrix(data = my_data,label=mydata_label)
final_model <- xgboost(data=data_final, objective = "binary:logistic",
                   nrounds = 50, max_depth = 4, eta = 0.2, gamma = 0,
                   colsample_bytree = 0.6, min_child_weight = 1)

and we get the following result

pred <- predict(mymodel, data_final)
pred<-ifelse(pred>.5,1,0)
confusionMatrix(as.factor(pred),as.factor(mydata_label))

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 518  60
##          1  31 282
##                                          
##                Accuracy : 0.8979         
##                  95% CI : (0.8761, 0.917)
##     No Information Rate : 0.6162         
##     P-Value [Acc > NIR] : < 2.2e-16      
##                                          
##                   Kappa : 0.7806         
##                                          
##  Mcnemar's Test P-Value : 0.003333       
##                                          
##             Sensitivity : 0.9435         
##             Specificity : 0.8246         
##          Pos Pred Value : 0.8962         
##          Neg Pred Value : 0.9010         
##              Prevalence : 0.6162         
##          Detection Rate : 0.5814         
##    Detection Prevalence : 0.6487         
##       Balanced Accuracy : 0.8840         
##                                          
##        'Positive' Class : 0              
##

The accuracy rate with these values is about 90% . Now lets fit this model to the test.csv file.

kag<-read_csv("../test.csv")

## Parsed with column specification:
## cols(
##   PassengerId = col_double(),
##   Pclass = col_double(),
##   Name = col_character(),
##   Sex = col_character(),
##   Age = col_double(),
##   SibSp = col_double(),
##   Parch = col_double(),
##   Ticket = col_character(),
##   Fare = col_double(),
##   Cabin = col_character(),
##   Embarked = col_character()
## )

kag1<-kag[,-c(3,8,10)]
kag1 <- modify_at(kag1,c("Pclass", "Sex", "Embarked", "SibSp", "Parch"), as.factor)
summary(kag1)

##   PassengerId     Pclass      Sex           Age        SibSp       Parch    
##  Min.   : 892.0   1:107   female:152   Min.   : 0.17   0:283   0      :324  
##  1st Qu.: 996.2   2: 93   male  :266   1st Qu.:21.00   1:110   1      : 52  
##  Median :1100.5   3:218                Median :27.00   2: 14   2      : 33  
##  Mean   :1100.5                        Mean   :30.27   3:  4   3      :  3  
##  3rd Qu.:1204.8                        3rd Qu.:39.00   4:  4   4      :  2  
##  Max.   :1309.0                        Max.   :76.00   5:  1   9      :  2  
##                                        NA's   :86      8:  2   (Other):  2  
##       Fare         Embarked
##  Min.   :  0.000   C:102   
##  1st Qu.:  7.896   Q: 46   
##  Median : 14.454   S:270   
##  Mean   : 35.627           
##  3rd Qu.: 31.500           
##  Max.   :512.329           
##  NA's   :1

we have 86 missing values for Age and one for Far, using a good idea from a kaggler named Harrison Tietze who suggested to treat the persons with missing values as likely to be died. For instance he replaced the missing ages by the mean age of died persons from the train data. But for us we go even further and we consider all rows with missing values as died persons.
Additionally, when inspecting the summary above we notice that we have an extra level (9) in the factor Parch that is not existed in the traind data, and hence the model does not allow such extra information. However, since this level has only two cases we can approximate this level by the closest one which is 6, then we drop the level 9 from this factor.

kag1$Parch[kag1$Parch==9]<-6
kag1$Parch <- kag1$Parch %>% forcats::fct_drop()
kag_died <- kag1[!complete.cases(kag1),]
kag2 <- kag1[complete.cases(kag1),]

So we only use the kag2 data for the prediction.

DP<-sparse.model.matrix(PassengerId~.-1,data=kag2)
head(DP)

## 6 x 20 sparse Matrix of class "dgCMatrix"

##    [[ suppressing 20 column names 'Pclass1', 'Pclass2', 'Pclass3' ... ]]

##                                                   
## 1 . . 1 1 34.5 . . . . . . . . . . . .  7.8292 1 .
## 2 . . 1 . 47.0 1 . . . . . . . . . . .  7.0000 . 1
## 3 . 1 . 1 62.0 . . . . . . . . . . . .  9.6875 1 .
## 4 . . 1 1 27.0 . . . . . . . . . . . .  8.6625 . 1
## 5 . . 1 . 22.0 1 . . . . . 1 . . . . . 12.2875 . 1
## 6 . . 1 1 14.0 . . . . . . . . . . . .  9.2250 . 1

predkag<-predict(final_model,DP)
head(predkag)

## [1] 0.10634395 0.17170778 0.09650294 0.12390183 0.60250586 0.11714594

As we see the output is the probability of each instance, so we should convert this probabbilitis to classe labels:

predkag<-ifelse(predkag>.5,1,0)

Now first we cbined passengerId with the fitted values named as Survived, next we rbind with the first set kag1 :

predkag2K<-cbind(kag2[,1],Survived=predkag)
kag_died$Survived <- 0
predtestk <- rbind(predkag2K,kag_died[, c(1,9)])

Finally, we save the file as csv file to submit it to kaggle then check our rank :

write_csv(predtestk,"predxgbkag.csv")

7 Conclusion:

Xgboost is the best machine learning algorithm nowadays due to its powerful capability to predict wide range of data from various domains. Several win competitions in kaggle and elsewhere are achieved by this model. It can handle large and complex data with ease. The large number of hyperparameters that has give the modeler a large possibilities to tune the model with respect to the data at their hand as well as to fight other problems such as overfitting, feature selection…ect.

8 Session information

sessionInfo()

## R version 4.0.1 (2020-06-06)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 19041)
## 
## Matrix products: default
## 
## locale:
## [1] LC_COLLATE=English_United States.1252 
## [2] LC_CTYPE=English_United States.1252   
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C                          
## [5] LC_TIME=English_United States.1252    
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] xgboost_1.2.0.1 Matrix_1.2-18   mice_3.11.0     caret_6.0-86   
##  [5] lattice_0.20-41 forcats_0.5.0   stringr_1.4.0   dplyr_1.0.2    
##  [9] purrr_0.3.4     readr_1.3.1     tidyr_1.1.2     tibble_3.0.3   
## [13] ggplot2_3.3.2   tidyverse_1.3.0
## 
## loaded via a namespace (and not attached):
##  [1] nlme_3.1-149         fs_1.5.0             lubridate_1.7.9     
##  [4] httr_1.4.2           tools_4.0.1          backports_1.1.10    
##  [7] utf8_1.1.4           R6_2.4.1             rpart_4.1-15        
## [10] DBI_1.1.0            colorspace_1.4-1     nnet_7.3-14         
## [13] withr_2.3.0          tidyselect_1.1.0     compiler_4.0.1      
## [16] cli_2.0.2            rvest_0.3.6          xml2_1.3.2          
## [19] labeling_0.3         bookdown_0.20        scales_1.1.1        
## [22] randomForest_4.6-14  digest_0.6.25        rmarkdown_2.4       
## [25] pkgconfig_2.0.3      htmltools_0.5.0      dbplyr_1.4.4        
## [28] rlang_0.4.7          readxl_1.3.1         rstudioapi_0.11     
## [31] generics_0.0.2       farver_2.0.3         jsonlite_1.7.1      
## [34] ModelMetrics_1.2.2.2 magrittr_1.5         Rcpp_1.0.5          
## [37] munsell_0.5.0        fansi_0.4.1          lifecycle_0.2.0     
## [40] stringi_1.5.3        pROC_1.16.2          yaml_2.2.1          
## [43] MASS_7.3-53          plyr_1.8.6           recipes_0.1.13      
## [46] grid_4.0.1           blob_1.2.1           crayon_1.3.4        
## [49] haven_2.3.1          splines_4.0.1        hms_0.5.3           
## [52] knitr_1.30           pillar_1.4.6         reshape2_1.4.4      
## [55] codetools_0.2-16     stats4_4.0.1         reprex_0.3.0        
## [58] glue_1.4.2           evaluate_0.14        blogdown_0.20       
## [61] data.table_1.13.0    modelr_0.1.8         vctrs_0.3.4         
## [64] foreach_1.5.0        cellranger_1.1.0     gtable_0.3.0        
## [67] assertthat_0.2.1     xfun_0.18            gower_0.2.2         
## [70] prodlim_2019.11.13   broom_0.7.1          e1071_1.7-3         
## [73] class_7.3-17         survival_3.2-7       timeDate_3043.102   
## [76] iterators_1.0.12     lava_1.6.8           ellipsis_0.3.1      
## [79] ipred_0.9-9

Kevin P.Murphy 2012↩︎

logistic regression

Thu, 19 Dec 2019 00:00:00 +0000

1 Introduction

In this paper we will fit a logistic regression model to the heart disease data uploaded from kaggle website.

For the data preparation we will follow the same steps as we did in my previous paper about naive bayes model, for more detail thus click here to get access to that paper.

2 Data preparation

First we call our data with the required packages

library(tidyverse, warn.conflicts = FALSE)
library(caret, warn.conflicts = FALSE)
mydata<-read.csv("heart.csv",header = TRUE)
names(mydata)[1]<-"age"
glimpse(mydata)

## Rows: 303
## Columns: 14
## $ age      <int> 63, 37, 41, 56, 57, 57, 56, 44, 52, 57, 54, 48, 49, 64, 58...
## $ sex      <int> 1, 1, 0, 1, 0, 1, 0, 1, 1, 1, 1, 0, 1, 1, 0, 0, 0, 0, 1, 0...
## $ cp       <int> 3, 2, 1, 1, 0, 0, 1, 1, 2, 2, 0, 2, 1, 3, 3, 2, 2, 3, 0, 3...
## $ trestbps <int> 145, 130, 130, 120, 120, 140, 140, 120, 172, 150, 140, 130...
## $ chol     <int> 233, 250, 204, 236, 354, 192, 294, 263, 199, 168, 239, 275...
## $ fbs      <int> 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0...
## $ restecg  <int> 0, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1...
## $ thalach  <int> 150, 187, 172, 178, 163, 148, 153, 173, 162, 174, 160, 139...
## $ exang    <int> 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0...
## $ oldpeak  <dbl> 2.3, 3.5, 1.4, 0.8, 0.6, 0.4, 1.3, 0.0, 0.5, 1.6, 1.2, 0.2...
## $ slope    <int> 0, 0, 2, 2, 2, 1, 1, 2, 2, 2, 2, 2, 2, 1, 2, 1, 2, 0, 2, 2...
## $ ca       <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2...
## $ thal     <int> 1, 2, 2, 2, 2, 1, 2, 3, 3, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2...
## $ target   <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...

The data at hand has the following features:

age.
sex: 1=male,0=female
cp : chest pain type.
trestbps : resting blood pressure.
chol: serum cholestoral.
fbs : fasting blood sugar.
restecg : resting electrocardiographic results.
thalach : maximum heart rate achieved
exang : exercise induced angina.
oldpeak : ST depression induced by exercise relative to rest.
slope : the slope of the peak exercise ST segment.
ca : number of major vessels colored by flourosopy.
thal : it is not well defined from the data source.
target: have heart disease or not.

We see that some features should be converted to factor type as follows:m

mydata<-mydata %>%
  modify_at(c(2,3,6,7,9,11,12,13,14),as.factor)
glimpse(mydata)

## Rows: 303
## Columns: 14
## $ age      <int> 63, 37, 41, 56, 57, 57, 56, 44, 52, 57, 54, 48, 49, 64, 58...
## $ sex      <fct> 1, 1, 0, 1, 0, 1, 0, 1, 1, 1, 1, 0, 1, 1, 0, 0, 0, 0, 1, 0...
## $ cp       <fct> 3, 2, 1, 1, 0, 0, 1, 1, 2, 2, 0, 2, 1, 3, 3, 2, 2, 3, 0, 3...
## $ trestbps <int> 145, 130, 130, 120, 120, 140, 140, 120, 172, 150, 140, 130...
## $ chol     <int> 233, 250, 204, 236, 354, 192, 294, 263, 199, 168, 239, 275...
## $ fbs      <fct> 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0...
## $ restecg  <fct> 0, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1...
## $ thalach  <int> 150, 187, 172, 178, 163, 148, 153, 173, 162, 174, 160, 139...
## $ exang    <fct> 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0...
## $ oldpeak  <dbl> 2.3, 3.5, 1.4, 0.8, 0.6, 0.4, 1.3, 0.0, 0.5, 1.6, 1.2, 0.2...
## $ slope    <fct> 0, 0, 2, 2, 2, 1, 1, 2, 2, 2, 2, 2, 2, 1, 2, 1, 2, 0, 2, 2...
## $ ca       <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2...
## $ thal     <fct> 1, 2, 2, 2, 2, 1, 2, 3, 3, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2...
## $ target   <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...

Before going head we should check the relationships between the target variable and the remaining factors

xtabs(~target+sex,data=mydata)

##       sex
## target   0   1
##      0  24 114
##      1  72  93

xtabs(~target+cp,data=mydata)

##       cp
## target   0   1   2   3
##      0 104   9  18   7
##      1  39  41  69  16

xtabs(~target+fbs,data=mydata)

##       fbs
## target   0   1
##      0 116  22
##      1 142  23

xtabs(~target+restecg,data=mydata)

##       restecg
## target  0  1  2
##      0 79 56  3
##      1 68 96  1

xtabs(~target+exang,data=mydata)

##       exang
## target   0   1
##      0  62  76
##      1 142  23

xtabs(~target+slope,data=mydata)

##       slope
## target   0   1   2
##      0  12  91  35
##      1   9  49 107

xtabs(~target+ca,data=mydata)

##       ca
## target   0   1   2   3   4
##      0  45  44  31  17   1
##      1 130  21   7   3   4

xtabs(~target+thal,data=mydata)

##       thal
## target   0   1   2   3
##      0   1  12  36  89
##      1   1   6 130  28

As we see the restecg,ca and thal variables have values less than the threshold of 5 casses required for logistic regression. In addition if we split the data between training set and test set the level 2 of the restecg variable will not be found in one of the sets since we have only one case. Therfore we should remove these variables from the model.

mydata<-mydata[,-c(7,12,13)]
glimpse(mydata)

## Rows: 303
## Columns: 11
## $ age      <int> 63, 37, 41, 56, 57, 57, 56, 44, 52, 57, 54, 48, 49, 64, 58...
## $ sex      <fct> 1, 1, 0, 1, 0, 1, 0, 1, 1, 1, 1, 0, 1, 1, 0, 0, 0, 0, 1, 0...
## $ cp       <fct> 3, 2, 1, 1, 0, 0, 1, 1, 2, 2, 0, 2, 1, 3, 3, 2, 2, 3, 0, 3...
## $ trestbps <int> 145, 130, 130, 120, 120, 140, 140, 120, 172, 150, 140, 130...
## $ chol     <int> 233, 250, 204, 236, 354, 192, 294, 263, 199, 168, 239, 275...
## $ fbs      <fct> 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0...
## $ thalach  <int> 150, 187, 172, 178, 163, 148, 153, 173, 162, 174, 160, 139...
## $ exang    <fct> 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0...
## $ oldpeak  <dbl> 2.3, 3.5, 1.4, 0.8, 0.6, 0.4, 1.3, 0.0, 0.5, 1.6, 1.2, 0.2...
## $ slope    <fct> 0, 0, 2, 2, 2, 1, 1, 2, 2, 2, 2, 2, 2, 1, 2, 1, 2, 0, 2, 2...
## $ target   <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...

Before training our model, we can get a vague insight about the predictors that have some importance for the prediction of the dependent variable.

Let’s plot the relationships between the target variabl and the other features.

ggplot(mydata,aes(sex,target,color=target))+
  geom_jitter()

If we look only at the red points (healthy patients) we can wrongly interpret that females are less healthy than males. This is because we do not take into account that we have imbalanced number of each sex level (96 females , 207 males). in contrast, if we look only at females we can say that a particular female are more likely to have the disease than not.

ggplot(mydata,aes(cp, fill = target))+
  geom_bar(stat = "count", position = "dodge")

From this plot we can conclude that if the patient does not have any chest pain he/she will be highly unlikely to get the disease, otherwise for any chest type the patient will be more likely to be pathologique by this disease. we can expect therfore that this predictor will have a significant importance on the training model.

ggplot(mydata, aes(age,fill=target))+
  geom_density(alpha=.5)

3 Data partition

we take out 80% of the data to use as training set and the rest will be put aside to evaluate the model performance.

set.seed(1234)
index<-createDataPartition(mydata$target, p=.8,list=FALSE)
train<-mydata[index,]
test<-mydata[-index,]

4 train the model

We are now ready to train our model.

model <- glm(target~., data=train,family = "binomial")
summary(model)

## 
## Call:
## glm(formula = target ~ ., family = "binomial", data = train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.5855  -0.5294   0.1990   0.6120   2.4022  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  3.715274   2.883238   1.289 0.197545    
## age         -0.014712   0.023285  -0.632 0.527502    
## sex1        -1.686359   0.479254  -3.519 0.000434 ***
## cp1          1.212919   0.549670   2.207 0.027340 *  
## cp2          2.010255   0.486638   4.131 3.61e-05 ***
## cp3          2.139066   0.682727   3.133 0.001730 ** 
## trestbps    -0.020471   0.012195  -1.679 0.093220 .  
## chol        -0.005840   0.003776  -1.547 0.121959    
## fbs1        -0.200690   0.519116  -0.387 0.699053    
## thalach      0.024461   0.010928   2.238 0.025196 *  
## exang1      -0.792717   0.431434  -1.837 0.066151 .  
## oldpeak     -0.820508   0.231100  -3.550 0.000385 ***
## slope1      -0.999768   1.015514  -0.984 0.324872    
## slope2      -0.767247   1.097448  -0.699 0.484477    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 335.05  on 242  degrees of freedom
## Residual deviance: 191.33  on 229  degrees of freedom
## AIC: 219.33
## 
## Number of Fisher Scoring iterations: 5

we see that some variables are not significant using p-value such as age, chol,fbs,slope, and also the intercept. First let’s remove the insignificant factor variables fbs and slope.

model <- glm(target~.-fbs-slope, data=train,family = "binomial")
summary(model)

## 
## Call:
## glm(formula = target ~ . - fbs - slope, family = "binomial", 
##     data = train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.6702  -0.5505   0.1993   0.6344   2.4495  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  2.826395   2.695175   1.049 0.294322    
## age         -0.016677   0.023157  -0.720 0.471420    
## sex1        -1.729320   0.470656  -3.674 0.000239 ***
## cp1          1.243879   0.548288   2.269 0.023289 *  
## cp2          1.987151   0.472994   4.201 2.65e-05 ***
## cp3          2.125766   0.677257   3.139 0.001696 ** 
## trestbps    -0.020672   0.012005  -1.722 0.085084 .  
## chol        -0.006434   0.003721  -1.729 0.083816 .  
## thalach      0.026567   0.010432   2.547 0.010873 *  
## exang1      -0.848162   0.423189  -2.004 0.045047 *  
## oldpeak     -0.798699   0.198597  -4.022 5.78e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 335.05  on 242  degrees of freedom
## Residual deviance: 192.66  on 232  degrees of freedom
## AIC: 214.66
## 
## Number of Fisher Scoring iterations: 5

Now we remove the age variable since it is the least significance.

model <- glm(target~.-fbs-slope-age, data=train,family = "binomial")
summary(model)

## 
## Call:
## glm(formula = target ~ . - fbs - slope - age, family = "binomial", 
##     data = train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.6925  -0.5397   0.2032   0.6345   2.4032  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  1.703126   2.188741   0.778 0.436492    
## sex1        -1.677986   0.463447  -3.621 0.000294 ***
## cp1          1.221925   0.545175   2.241 0.025004 *  
## cp2          1.961200   0.468443   4.187 2.83e-05 ***
## cp3          2.085409   0.676469   3.083 0.002051 ** 
## trestbps    -0.022133   0.011872  -1.864 0.062273 .  
## chol        -0.006900   0.003675  -1.878 0.060443 .  
## thalach      0.029761   0.009471   3.142 0.001676 ** 
## exang1      -0.820113   0.420434  -1.951 0.051101 .  
## oldpeak     -0.803423   0.198400  -4.050 5.13e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 335.05  on 242  degrees of freedom
## Residual deviance: 193.19  on 233  degrees of freedom
## AIC: 213.19
## 
## Number of Fisher Scoring iterations: 5

we remove now the variables exang.

model <- glm(target~.-fbs-slope-age-exang, data=train,family = "binomial")
summary(model)

## 
## Call:
## glm(formula = target ~ . - fbs - slope - age - exang, family = "binomial", 
##     data = train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.7030  -0.5643   0.2004   0.6510   2.5728  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  0.832691   2.105139   0.396 0.692436    
## sex1        -1.713577   0.459659  -3.728 0.000193 ***
## cp1          1.494091   0.528172   2.829 0.004672 ** 
## cp2          2.205121   0.454341   4.853 1.21e-06 ***
## cp3          2.220423   0.668760   3.320 0.000899 ***
## trestbps    -0.021812   0.011704  -1.864 0.062375 .  
## chol        -0.007110   0.003597  -1.977 0.048054 *  
## thalach      0.033412   0.009291   3.596 0.000323 ***
## oldpeak     -0.822277   0.195993  -4.195 2.72e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 335.05  on 242  degrees of freedom
## Residual deviance: 196.98  on 234  degrees of freedom
## AIC: 214.98
## 
## Number of Fisher Scoring iterations: 5

Notice that we can not remove intercept even it is not significant because it contains the first level of “0” of the factor cp which is significant. This is hence our final model.

5 prediction and confusion matrix

we will use this model to predict the training set.

pred <- predict(model,train, type="response")
head(pred)

##         2         3         4         6         7         8 
## 0.5202639 0.9331630 0.8330192 0.3354247 0.7730621 0.8705651

using the confusion matrix we get the accuracy rate in the training set.

pred <- as.integer(pred>0.5)
confusionMatrix(as.factor(pred),train$target, positive = "1")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0  87  17
##          1  24 115
##                                           
##                Accuracy : 0.8313          
##                  95% CI : (0.7781, 0.8761)
##     No Information Rate : 0.5432          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.6583          
##                                           
##  Mcnemar's Test P-Value : 0.3487          
##                                           
##             Sensitivity : 0.8712          
##             Specificity : 0.7838          
##          Pos Pred Value : 0.8273          
##          Neg Pred Value : 0.8365          
##              Prevalence : 0.5432          
##          Detection Rate : 0.4733          
##    Detection Prevalence : 0.5720          
##       Balanced Accuracy : 0.8275          
##                                           
##        'Positive' Class : 1               
##

In the training set the accuracy rate is about 83,13% . But we are more intrested in the accuracy of the test set.

pred <- predict(model,test, type="response")
pred <- as.integer(pred>0.5)
confusionMatrix(as.factor(pred),test$target)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 16  3
##          1 11 30
##                                           
##                Accuracy : 0.7667          
##                  95% CI : (0.6396, 0.8662)
##     No Information Rate : 0.55            
##     P-Value [Acc > NIR] : 0.0004231       
##                                           
##                   Kappa : 0.5156          
##                                           
##  Mcnemar's Test P-Value : 0.0613688       
##                                           
##             Sensitivity : 0.5926          
##             Specificity : 0.9091          
##          Pos Pred Value : 0.8421          
##          Neg Pred Value : 0.7317          
##              Prevalence : 0.4500          
##          Detection Rate : 0.2667          
##    Detection Prevalence : 0.3167          
##       Balanced Accuracy : 0.7508          
##                                           
##        'Positive' Class : 0               
##

With the test set we have lower accuracy rate about 76.67%.

6 The link function

By default the link function is logit from the sigmoid distribution, we can however make use of the link function probit instead, which stands for the normal distribution.

model1 <- glm(target~.-fbs-slope-exang-age, data=train,
             family = binomial(link = "probit"))
summary(model1)

## 
## Call:
## glm(formula = target ~ . - fbs - slope - exang - age, family = binomial(link = "probit"), 
##     data = train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.7779  -0.5883   0.1666   0.6670   2.5989  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  0.373007   1.199910   0.311 0.755905    
## sex1        -0.940784   0.252631  -3.724 0.000196 ***
## cp1          0.830588   0.299919   2.769 0.005616 ** 
## cp2          1.275100   0.253681   5.026 5.00e-07 ***
## cp3          1.262407   0.387479   3.258 0.001122 ** 
## trestbps    -0.011677   0.006660  -1.753 0.079549 .  
## chol        -0.004068   0.002047  -1.987 0.046870 *  
## thalach      0.018999   0.005163   3.680 0.000233 ***
## oldpeak     -0.470191   0.108935  -4.316 1.59e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 335.05  on 242  degrees of freedom
## Residual deviance: 197.23  on 234  degrees of freedom
## AIC: 215.23
## 
## Number of Fisher Scoring iterations: 6

pred <- predict(model,test, type="response")
pred <- as.integer(pred>0.5)
confusionMatrix(as.factor(pred),test$target)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 16  3
##          1 11 30
##                                           
##                Accuracy : 0.7667          
##                  95% CI : (0.6396, 0.8662)
##     No Information Rate : 0.55            
##     P-Value [Acc > NIR] : 0.0004231       
##                                           
##                   Kappa : 0.5156          
##                                           
##  Mcnemar's Test P-Value : 0.0613688       
##                                           
##             Sensitivity : 0.5926          
##             Specificity : 0.9091          
##          Pos Pred Value : 0.8421          
##          Neg Pred Value : 0.7317          
##              Prevalence : 0.4500          
##          Detection Rate : 0.2667          
##    Detection Prevalence : 0.3167          
##       Balanced Accuracy : 0.7508          
##                                           
##        'Positive' Class : 0               
##

As we see we get the same results with a slight difference between the AIC criterion 215.54 for probit link and 214.98 for logit link.

naive bayes

Thu, 19 Dec 2019 00:00:00 +0000

1 Introduction

Naive bayes model based on a strong assumption that the features are conditionally independent given the class label. Since this assumption is rarely when it is true, this model termed as naive. However, even this assumption is not satisfied the model still works very well (Kevin.P murphy 2012). Using this assumption we can define the class conditionall density as the product of one dimensional densities.

\[p(X|y=c,\theta)=\prod_{j=1}^Dp(x_j|y=c,\theta_{jc})\]

The possible one dimensional density for each feature depends on the type of the feature:

For real_valued features we can make use of gaussion distribution:

\[p(X|y=c,\theta)=\prod_{j=1}^D\mathcal N(\mu_{jc}|y=c,\sigma_{jc}^2)\]

For binary feature we can use bernouli distribution:

\[p(X|y=c,\theta)=\prod_{j=1}^DBer(x_j|\mu_{jc})\]

For categorical feature we can make use of multinouli distribution:

\[p(X|y=c,\theta)=\prod_{j=1}^DCat(x_j|\mu_{jc})\]

For data that has features of different types we can use a mixture product of the above distributions, and this is what we will do in this paper.

2 Data preparation

The data that we will use here is uploaded from kaggle website, which is about heart disease. let us start by calling the packages needed and the data, then we give an appropriate name to the first column

library(tidyverse)
library(caret)
mydata<-read.csv("heart.csv",header = TRUE)
names(mydata)[1]<-"age"
glimpse(mydata)

## Rows: 303
## Columns: 14
## $ age      <int> 63, 37, 41, 56, 57, 57, 56, 44, 52, 57, 54, 48, 49, 64, 58...
## $ sex      <int> 1, 1, 0, 1, 0, 1, 0, 1, 1, 1, 1, 0, 1, 1, 0, 0, 0, 0, 1, 0...
## $ cp       <int> 3, 2, 1, 1, 0, 0, 1, 1, 2, 2, 0, 2, 1, 3, 3, 2, 2, 3, 0, 3...
## $ trestbps <int> 145, 130, 130, 120, 120, 140, 140, 120, 172, 150, 140, 130...
## $ chol     <int> 233, 250, 204, 236, 354, 192, 294, 263, 199, 168, 239, 275...
## $ fbs      <int> 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0...
## $ restecg  <int> 0, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1...
## $ thalach  <int> 150, 187, 172, 178, 163, 148, 153, 173, 162, 174, 160, 139...
## $ exang    <int> 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0...
## $ oldpeak  <dbl> 2.3, 3.5, 1.4, 0.8, 0.6, 0.4, 1.3, 0.0, 0.5, 1.6, 1.2, 0.2...
## $ slope    <int> 0, 0, 2, 2, 2, 1, 1, 2, 2, 2, 2, 2, 2, 1, 2, 1, 2, 0, 2, 2...
## $ ca       <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2...
## $ thal     <int> 1, 2, 2, 2, 2, 1, 2, 3, 3, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2...
## $ target   <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...

the target variable indicates whether a patient has the disease or not based on the following features:

age.
sex: 1=male,0=female
cp : chest pain type.
trestbps : resting blood pressure.
chol: serum cholestoral.
fbs : fasting blood sugar.
restecg : resting electrocardiographic results.
thalach : maximum heart rate achieved
exang : exercise induced angina.
oldpeak : ST depression induced by exercise relative to rest.
slope : the slope of the peak exercise ST segment.
ca : number of major vessels colored by flourosopy.
thal : it is not well defined from the data source.
target: have heart disease or not.

The most intuitive thing by which we start our analysis is by getting the summary of this data to check the range, the five quantiles, and the existance or not of missing values for each feature.

summary(mydata)

##       age             sex               cp           trestbps    
##  Min.   :29.00   Min.   :0.0000   Min.   :0.000   Min.   : 94.0  
##  1st Qu.:47.50   1st Qu.:0.0000   1st Qu.:0.000   1st Qu.:120.0  
##  Median :55.00   Median :1.0000   Median :1.000   Median :130.0  
##  Mean   :54.37   Mean   :0.6832   Mean   :0.967   Mean   :131.6  
##  3rd Qu.:61.00   3rd Qu.:1.0000   3rd Qu.:2.000   3rd Qu.:140.0  
##  Max.   :77.00   Max.   :1.0000   Max.   :3.000   Max.   :200.0  
##       chol            fbs            restecg          thalach     
##  Min.   :126.0   Min.   :0.0000   Min.   :0.0000   Min.   : 71.0  
##  1st Qu.:211.0   1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:133.5  
##  Median :240.0   Median :0.0000   Median :1.0000   Median :153.0  
##  Mean   :246.3   Mean   :0.1485   Mean   :0.5281   Mean   :149.6  
##  3rd Qu.:274.5   3rd Qu.:0.0000   3rd Qu.:1.0000   3rd Qu.:166.0  
##  Max.   :564.0   Max.   :1.0000   Max.   :2.0000   Max.   :202.0  
##      exang           oldpeak         slope             ca        
##  Min.   :0.0000   Min.   :0.00   Min.   :0.000   Min.   :0.0000  
##  1st Qu.:0.0000   1st Qu.:0.00   1st Qu.:1.000   1st Qu.:0.0000  
##  Median :0.0000   Median :0.80   Median :1.000   Median :0.0000  
##  Mean   :0.3267   Mean   :1.04   Mean   :1.399   Mean   :0.7294  
##  3rd Qu.:1.0000   3rd Qu.:1.60   3rd Qu.:2.000   3rd Qu.:1.0000  
##  Max.   :1.0000   Max.   :6.20   Max.   :2.000   Max.   :4.0000  
##       thal           target      
##  Min.   :0.000   Min.   :0.0000  
##  1st Qu.:2.000   1st Qu.:0.0000  
##  Median :2.000   Median :1.0000  
##  Mean   :2.314   Mean   :0.5446  
##  3rd Qu.:3.000   3rd Qu.:1.0000  
##  Max.   :3.000   Max.   :1.0000

After inspecting the features we see that Some variables should be treated as factors rather than numerics such as sex, cp, fbs, restecg, exange, slope, ca, thal, and the target variable, hence they will be converted to factor type as follows:

mydata<-mydata %>%
  mutate_at(c(2,3,6,7,9,11,12,13,14),funs(as.factor))
summary(mydata)

##       age        sex     cp         trestbps          chol       fbs    
##  Min.   :29.00   0: 96   0:143   Min.   : 94.0   Min.   :126.0   0:258  
##  1st Qu.:47.50   1:207   1: 50   1st Qu.:120.0   1st Qu.:211.0   1: 45  
##  Median :55.00           2: 87   Median :130.0   Median :240.0          
##  Mean   :54.37           3: 23   Mean   :131.6   Mean   :246.3          
##  3rd Qu.:61.00                   3rd Qu.:140.0   3rd Qu.:274.5          
##  Max.   :77.00                   Max.   :200.0   Max.   :564.0          
##  restecg    thalach      exang      oldpeak     slope   ca      thal    target 
##  0:147   Min.   : 71.0   0:204   Min.   :0.00   0: 21   0:175   0:  2   0:138  
##  1:152   1st Qu.:133.5   1: 99   1st Qu.:0.00   1:140   1: 65   1: 18   1:165  
##  2:  4   Median :153.0           Median :0.80   2:142   2: 38   2:166          
##          Mean   :149.6           Mean   :1.04           3: 20   3:117          
##          3rd Qu.:166.0           3rd Qu.:1.60           4:  5                  
##          Max.   :202.0           Max.   :6.20

In practice It is very usefull to inspect (by traditional statistic test such as kisq or correlation coefficient) the relationships between the target variable and each of the potential explanatory variables before building any model, doing so we can tell apart the relevant variables from the irrelvant ones and hence which of which should include in our model. Another important issue with factors is that when spliting the data between training set and testing set some factor level can be missing in one set if the the number of casses for that level is too small.
let’s check if all the factor levels contribute on each target variable level.

xtabs(~target+sex,data=mydata)

##       sex
## target   0   1
##      0  24 114
##      1  72  93

xtabs(~target+cp,data=mydata)

##       cp
## target   0   1   2   3
##      0 104   9  18   7
##      1  39  41  69  16

xtabs(~target+fbs,data=mydata)

##       fbs
## target   0   1
##      0 116  22
##      1 142  23

xtabs(~target+restecg,data=mydata)

##       restecg
## target  0  1  2
##      0 79 56  3
##      1 68 96  1

xtabs(~target+exang,data=mydata)

##       exang
## target   0   1
##      0  62  76
##      1 142  23

xtabs(~target+slope,data=mydata)

##       slope
## target   0   1   2
##      0  12  91  35
##      1   9  49 107

xtabs(~target+ca,data=mydata)

##       ca
## target   0   1   2   3   4
##      0  45  44  31  17   1
##      1 130  21   7   3   4

xtabs(~target+thal,data=mydata)

##       thal
## target   0   1   2   3
##      0   1  12  36  89
##      1   1   6 130  28

As we see the restecg,ca and thal variables have values less than the threshold of 5 casses required, so if we split the data between training set and test set the level 2 of the restecg variable will not be found in one of the sets since we have only one case. Therfore we should remove these variables from the model.

mydata<-mydata[,-c(7,12,13)]
glimpse(mydata)

## Rows: 303
## Columns: 11
## $ age      <int> 63, 37, 41, 56, 57, 57, 56, 44, 52, 57, 54, 48, 49, 64, 58...
## $ sex      <fct> 1, 1, 0, 1, 0, 1, 0, 1, 1, 1, 1, 0, 1, 1, 0, 0, 0, 0, 1, 0...
## $ cp       <fct> 3, 2, 1, 1, 0, 0, 1, 1, 2, 2, 0, 2, 1, 3, 3, 2, 2, 3, 0, 3...
## $ trestbps <int> 145, 130, 130, 120, 120, 140, 140, 120, 172, 150, 140, 130...
## $ chol     <int> 233, 250, 204, 236, 354, 192, 294, 263, 199, 168, 239, 275...
## $ fbs      <fct> 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0...
## $ thalach  <int> 150, 187, 172, 178, 163, 148, 153, 173, 162, 174, 160, 139...
## $ exang    <fct> 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0...
## $ oldpeak  <dbl> 2.3, 3.5, 1.4, 0.8, 0.6, 0.4, 1.3, 0.0, 0.5, 1.6, 1.2, 0.2...
## $ slope    <fct> 0, 0, 2, 2, 2, 1, 1, 2, 2, 2, 2, 2, 2, 1, 2, 1, 2, 0, 2, 2...
## $ target   <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...

Before training our model, we can get a vague insight about the predictors that have some importance for the prediction of the dependent variable.

Let’s plot the relationships between the target variabl and the other features.

ggplot(mydata,aes(sex,target,color=target))+
  geom_jitter()

ggplot(mydata,aes(cp,fill= target))+
  geom_bar(stat = "count",position = "dodge")

ggplot(mydata, aes(age,fill=target))+
  geom_density(alpha=.5)

Even there exist a large amount of overlapping between the two densities which may violate the independence assumption, it still exist some difference since these are drawn from the sample not the from the true distributions. However, we do not care much about it since we will evaluate the resulted model by using the testing set.
we can also check this assumption with the corralation matrix.

library(psych)
pairs.panels(mydata[,-11])

AS we see all the correlations are less than 50% so we can go ahead and train our model.

3 Data partition

we take out 80% of the data to use as training set and the rest will be put aside to evaluate the model performance.

set.seed(1234)
index<-createDataPartition(mydata$target, p=.8,list=FALSE)
train<-mydata[index,]
test<-mydata[-index,]

4 Model training

Note: for this model we do not need to set seed because this model uses known densities for the predictors and does not use any random method.

library(naivebayes)
modelnv<-naive_bayes(target~.,data=train)
modelnv

## 
## ================================== Naive Bayes ================================== 
##  
##  Call: 
## naive_bayes.formula(formula = target ~ ., data = train)
## 
## --------------------------------------------------------------------------------- 
##  
## Laplace smoothing: 0
## 
## --------------------------------------------------------------------------------- 
##  
##  A priori probabilities: 
## 
##         0         1 
## 0.4567901 0.5432099 
## 
## --------------------------------------------------------------------------------- 
##  
##  Tables: 
## 
## --------------------------------------------------------------------------------- 
##  ::: age (Gaussian) 
## --------------------------------------------------------------------------------- 
##       
## age            0         1
##   mean 56.432432 52.378788
##   sd    8.410623  9.896819
## 
## --------------------------------------------------------------------------------- 
##  ::: sex (Bernoulli) 
## --------------------------------------------------------------------------------- 
##    
## sex         0         1
##   0 0.1891892 0.3939394
##   1 0.8108108 0.6060606
## 
## --------------------------------------------------------------------------------- 
##  ::: cp (Categorical) 
## --------------------------------------------------------------------------------- 
##    
## cp           0          1
##   0 0.75675676 0.22727273
##   1 0.07207207 0.25000000
##   2 0.12612613 0.42424242
##   3 0.04504505 0.09848485
## 
## --------------------------------------------------------------------------------- 
##  ::: trestbps (Gaussian) 
## --------------------------------------------------------------------------------- 
##         
## trestbps         0         1
##     mean 133.82883 128.75758
##     sd    18.26267  15.21857
## 
## --------------------------------------------------------------------------------- 
##  ::: chol (Gaussian) 
## --------------------------------------------------------------------------------- 
##       
## chol           0         1
##   mean 248.52252 240.80303
##   sd    51.07194  53.55705
## 
## ---------------------------------------------------------------------------------
## 
## # ... and 5 more tables
## 
## ---------------------------------------------------------------------------------

As we see each predictor is treated depending on its type, gaussion distribution for numeric variables, bernouli distribution for binary variables and multinouli distribution for categorical variables.

all the informations about this model can be extracted using the function attributes.

attributes(modelnv)

## $names
## [1] "data"       "levels"     "laplace"    "tables"     "prior"     
## [6] "usekernel"  "usepoisson" "call"      
## 
## $class
## [1] "naive_bayes"

we can visualize the above reults with the fuction plot that provides us by plot the distribution of each features, densities for numeric features and bars for factors. .

plot(modelnv)

5 Model evaluation

We can check the accuracy of the training data of this model using the confusion matrix.

pred<-predict(modelnv,train)
confusionMatrix(pred,train$target)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0  86  24
##          1  25 108
##                                           
##                Accuracy : 0.7984          
##                  95% CI : (0.7423, 0.8469)
##     No Information Rate : 0.5432          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.5934          
##                                           
##  Mcnemar's Test P-Value : 1               
##                                           
##             Sensitivity : 0.7748          
##             Specificity : 0.8182          
##          Pos Pred Value : 0.7818          
##          Neg Pred Value : 0.8120          
##              Prevalence : 0.4568          
##          Detection Rate : 0.3539          
##    Detection Prevalence : 0.4527          
##       Balanced Accuracy : 0.7965          
##                                           
##        'Positive' Class : 0               
##

The accuracy rate of the training set is about 79.84%. as expected the specificity rate (81.82%) for class 1 is much larger than the snesitivity rate (77.48) for class 0. This is reflectd by the fact that we have larger number of class 1 than class 0.

print(prop.table(table(train$target)),digits = 2)

## 
##    0    1 
## 0.46 0.54

The reliable evaluation is that based on the unseen testing data rather than the training data.

pred<-predict(modelnv,test)
confusionMatrix(pred,test$target)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 18  6
##          1  9 27
##                                           
##                Accuracy : 0.75            
##                  95% CI : (0.6214, 0.8528)
##     No Information Rate : 0.55            
##     P-Value [Acc > NIR] : 0.001116        
##                                           
##                   Kappa : 0.4898          
##                                           
##  Mcnemar's Test P-Value : 0.605577        
##                                           
##             Sensitivity : 0.6667          
##             Specificity : 0.8182          
##          Pos Pred Value : 0.7500          
##          Neg Pred Value : 0.7500          
##              Prevalence : 0.4500          
##          Detection Rate : 0.3000          
##    Detection Prevalence : 0.4000          
##       Balanced Accuracy : 0.7424          
##                                           
##        'Positive' Class : 0               
##

The accuracy rate of the test set now is about 75%, may be due to overfitting problem, or this kind of model is not suitable for this data.

6 Model fine-tuning:

In order to increase the model performance we can try another set of hyperparameters. Naive bayes model has different kernels and by default the usekernel argument is set to be FALSE which allows the use of the gaussion distriburtion for the numeric variables,if TRUE the kernel density estimation applies instead. Let’s turn it to be TRUE and see what will happen for the test accuracy rate.

modelnv1<-naive_bayes(target~.,data=train,
                      usekernel = TRUE)
pred<-predict(modelnv1,test)
confusionMatrix(pred,test$target)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 19  6
##          1  8 27
##                                           
##                Accuracy : 0.7667          
##                  95% CI : (0.6396, 0.8662)
##     No Information Rate : 0.55            
##     P-Value [Acc > NIR] : 0.0004231       
##                                           
##                   Kappa : 0.5254          
##                                           
##  Mcnemar's Test P-Value : 0.7892680       
##                                           
##             Sensitivity : 0.7037          
##             Specificity : 0.8182          
##          Pos Pred Value : 0.7600          
##          Neg Pred Value : 0.7714          
##              Prevalence : 0.4500          
##          Detection Rate : 0.3167          
##    Detection Prevalence : 0.4167          
##       Balanced Accuracy : 0.7609          
##                                           
##        'Positive' Class : 0               
##

After using the kernel estimation we have obtained a slight improvement for the accuracy rate which is now about 76%.

Another way to improve the model is to try to preprocess the data, especailly for numeric when we standardize them they would follow the normal distribution.

modelnv2<-train(target~., data=train,
                method="naive_bayes",
                preProc=c("center","scale"))
modelnv2

## Naive Bayes 
## 
## 243 samples
##  10 predictor
##   2 classes: '0', '1' 
## 
## Pre-processing: centered (13), scaled (13) 
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 243, 243, 243, 243, 243, 243, ... 
## Resampling results across tuning parameters:
## 
##   usekernel  Accuracy   Kappa    
##   FALSE      0.7775205  0.5511328
##    TRUE      0.7490468  0.4988034
## 
## Tuning parameter 'laplace' was held constant at a value of 0
## Tuning
##  parameter 'adjust' was held constant at a value of 1
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were laplace = 0, usekernel = FALSE
##  and adjust = 1.

As we see we get better accuracy rate with the gaussion distribution 78.48% (when usekernel=FALSE) than with the kernel estimation 78.48%.

Let’s use the test set:

pred<-predict(modelnv2,test)
confusionMatrix(pred,test$target)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 19  5
##          1  8 28
##                                          
##                Accuracy : 0.7833         
##                  95% CI : (0.658, 0.8793)
##     No Information Rate : 0.55           
##     P-Value [Acc > NIR] : 0.0001472      
##                                          
##                   Kappa : 0.5578         
##                                          
##  Mcnemar's Test P-Value : 0.5790997      
##                                          
##             Sensitivity : 0.7037         
##             Specificity : 0.8485         
##          Pos Pred Value : 0.7917         
##          Neg Pred Value : 0.7778         
##              Prevalence : 0.4500         
##          Detection Rate : 0.3167         
##    Detection Prevalence : 0.4000         
##       Balanced Accuracy : 0.7761         
##                                          
##        'Positive' Class : 0              
##

We have another slight improvment with accuracy rate 78.33 after scaling the data.

7 Conclusion

Naive Bayes model is the most widely used model in the classical machine learning models, especially with features that are originally normally distributed or after transformation. However, compared to the bagged or boosted models like random forest exgboost models, or compared to deep learning models it is quite less attractive.

knn model

Mon, 16 Dec 2019 00:00:00 +0000

1 Introduction

In this paper we will explore the k nearest neighbors model using two data sets, the first is Tiatanic data to which we will fit this model for classification, and the second data is BostonHousing data (from mlbench package) that will be used to fit a regression model.

2 Classification

We do not repeat the whole process for data preparation and missing values imputation. you can click here to see all the detail in my paper about support vector machine model.

3 Data partition

All the codes for the first steps are grouped in one chunk. If you notice we are using the same specified parameter values and seed numbers to be able to compare the results of the tow models svm and knn for classification (Using titanic data) and for regression (using BostonHousing data)

This plot shows how knn model works. With k=5 the model chooses the 5 closest points inside the dashed circle, and hence the blue point will be predicted to be red using the majority vote (3 red and 2 black), but with k=9 the blue point will be predicted to be black (5 black and 4 red).

library(plotrix)
plot(train$Age[10:40],pch=16,train$Fare[10:40],
     col=train$Survived,ylim = c(0,50))
points(x=32,y=20,col="blue",pch=8)
draw.circle(x=32,y=20,nv=1000,radius = 5.5,lty=2)
draw.circle(x=32,y=20,nv=1000,radius = 10)

The last things we should do before training the model is converting the factors to be numerics and standardizing all the predictors for both sets (train and test), and finally we rename the target variable levels

train1 <- train %>% mutate_at(c(2,3,8),funs(as.numeric))
test1 <- test %>% mutate_at(c(2,3,8),funs(as.numeric))

processed<-preProcess(train1[,-1],method = c("center","scale"))
train1[,-1]<-predict(processed,train1[,-1])
test1[,-1]<-predict(processed,test1[,-1])

train1$Survived <- fct_recode(train1$Survived,died="0",surv="1")
test1$Survived <- fct_recode(test1$Survived,died="0",surv="1")

4 Train the model

The big advantage of the k nearest neighbors model is that it has one single parameters which make the tuning process very fast. Here also we will make use of the same seed as we did with svm model. for the resampling process we will stick with the default bootstrapped method with 25 resampling iterations.

Let’s now launch the model and get the summary.

set.seed(123)
modelknn <- train(Survived~., data=train1,
                method="knn",
                tuneGrid=expand.grid(k=1:30))
modelknn

## k-Nearest Neighbors 
## 
## 714 samples
##   7 predictor
##   2 classes: 'died', 'surv' 
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 714, 714, 714, 714, 714, 714, ... 
## Resampling results across tuning parameters:
## 
##   k   Accuracy   Kappa    
##    1  0.7717650  0.5165447
##    2  0.7688433  0.5088538
##    3  0.7820906  0.5370428
##    4  0.7881072  0.5487894
##    5  0.8003926  0.5733224
##    6  0.7992870  0.5711806
##    7  0.8046907  0.5827968
##    8  0.8104254  0.5950159
##    9  0.8093172  0.5927121
##   10  0.8098395  0.5937574
##   11  0.8110456  0.5957105
##   12  0.8103966  0.5942937
##   13  0.8100784  0.5939193
##   14  0.8115080  0.5960496
##   15  0.8146848  0.6026109
##   16  0.8125027  0.5979064
##   17  0.8147065  0.6015528
##   18  0.8142485  0.6002677
##   19  0.8146543  0.6003686
##   20  0.8124733  0.5960520
##   21  0.8100367  0.5906732
##   22  0.8102084  0.5893078
##   23  0.8094241  0.5873995
##   24  0.8103509  0.5891549
##   25  0.8106517  0.5895533
##   26  0.8116000  0.5909129
##   27  0.8090177  0.5853052
##   28  0.8102358  0.5882055
##   29  0.8114371  0.5905057
##   30  0.8127604  0.5937279
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was k = 17.

The metric used to get the best parameter value is the accuracy rate , for which the best value is about 81.47% obtained at k=17. we can also get these values from the plot

plot(modelknn)

For the contributions of the predictors, the measure of importance scaled from 0 to 100 shows that the most important one is far the Sex, followed by Fare and Pclass , and the least important one is SibSp

varImp(modelknn)

## ROC curve variable importance
## 
##          Importance
## Sex         100.000
## Fare         62.476
## Pclass       57.192
## Embarked     17.449
## Parch        17.045
## Age           4.409
## SibSp         0.000

5 Prediction and confusion matrix

Let’s now use the test set to evaluate the model performance.

pred<-predict(modelknn,test1)
confusionMatrix(as.factor(pred),as.factor(test1$Survived))

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction died surv
##       died   99   26
##       surv   10   42
##                                           
##                Accuracy : 0.7966          
##                  95% CI : (0.7297, 0.8533)
##     No Information Rate : 0.6158          
##     P-Value [Acc > NIR] : 1.87e-07        
##                                           
##                   Kappa : 0.5503          
##                                           
##  Mcnemar's Test P-Value : 0.01242         
##                                           
##             Sensitivity : 0.9083          
##             Specificity : 0.6176          
##          Pos Pred Value : 0.7920          
##          Neg Pred Value : 0.8077          
##              Prevalence : 0.6158          
##          Detection Rate : 0.5593          
##    Detection Prevalence : 0.7062          
##       Balanced Accuracy : 0.7630          
##                                           
##        'Positive' Class : died            
##

We see that the accuracy has slightly decreased from 81.47% to 79.66. the closeness of this rates is a good sign that we do not face the overfitting problem.

6 Fine tuning the model

to seek improvements we can alter the metric. the best function that gives three importante metrics, sensitivity, specivicity and area under the ROC curve for each resampling iteration is twoClassSummary. Also we expand the grid search for the neighbors number to 30.

control <- trainControl(classProbs = TRUE,
                        summaryFunction = twoClassSummary)

set.seed(123)
modelknn1 <- train(Survived~., data=train1,
                method = "knn",
                trControl = control,
                tuneGrid = expand.grid(k=1:30))

## Warning in train.default(x, y, weights = w, ...): The metric "Accuracy" was not
## in the result set. ROC will be used instead.

modelknn1

## k-Nearest Neighbors 
## 
## 714 samples
##   7 predictor
##   2 classes: 'died', 'surv' 
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 714, 714, 714, 714, 714, 714, ... 
## Resampling results across tuning parameters:
## 
##   k   ROC        Sens       Spec     
##    1  0.7637394  0.8092152  0.7114938
##    2  0.7959615  0.8102352  0.7013654
##    3  0.8212495  0.8217986  0.7180595
##    4  0.8351414  0.8302266  0.7201146
##    5  0.8455418  0.8448702  0.7283368
##    6  0.8543141  0.8441066  0.7269378
##    7  0.8564044  0.8477382  0.7350766
##    8  0.8590356  0.8526960  0.7421475
##    9  0.8617600  0.8511745  0.7414201
##   10  0.8611361  0.8512356  0.7424516
##   11  0.8621287  0.8546357  0.7399914
##   12  0.8633050  0.8542288  0.7392237
##   13  0.8647328  0.8526082  0.7407331
##   14  0.8656300  0.8572596  0.7369673
##   15  0.8663956  0.8612937  0.7388392
##   16  0.8657711  0.8595923  0.7359633
##   17  0.8658168  0.8652505  0.7322408
##   18  0.8659659  0.8657088  0.7301132
##   19  0.8667079  0.8685106  0.7261585
##   20  0.8668361  0.8657052  0.7252522
##   21  0.8673051  0.8641660  0.7212182
##   22  0.8672610  0.8701453  0.7118060
##   23  0.8675945  0.8703195  0.7094977
##   24  0.8677684  0.8724153  0.7087639
##   25  0.8681884  0.8733028  0.7080003
##   26  0.8681201  0.8768128  0.7048740
##   27  0.8680570  0.8748635  0.7011357
##   28  0.8685130  0.8745234  0.7047600
##   29  0.8686459  0.8756557  0.7055821
##   30  0.8681316  0.8754088  0.7094507
## 
## ROC was used to select the optimal model using the largest value.
## The final value used for the model was k = 29.

This time we use the ROC to choose the best model which gives a different value of 29 with 0.8686 for the ROC.

pred<-predict(modelknn1,test1)
confusionMatrix(pred,test1$Survived)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction died surv
##       died   99   29
##       surv   10   39
##                                           
##                Accuracy : 0.7797          
##                  95% CI : (0.7113, 0.8384)
##     No Information Rate : 0.6158          
##     P-Value [Acc > NIR] : 2.439e-06       
##                                           
##                   Kappa : 0.5085          
##                                           
##  Mcnemar's Test P-Value : 0.003948        
##                                           
##             Sensitivity : 0.9083          
##             Specificity : 0.5735          
##          Pos Pred Value : 0.7734          
##          Neg Pred Value : 0.7959          
##              Prevalence : 0.6158          
##          Detection Rate : 0.5593          
##    Detection Prevalence : 0.7232          
##       Balanced Accuracy : 0.7409          
##                                           
##        'Positive' Class : died            
##

Using the ROC metric we get worse result for the accuracy rate which has decreased from 79.66% to 77.97%.

7 Comparison between knn and svm model

Now let’s train svm model with the same resamling method and we compare between them.

control<-trainControl(method="boot",number=25,
                      classProbs = TRUE,
                      summaryFunction = twoClassSummary)

modelsvm<-train(Survived~., data=train1,
                method="svmRadial",
                trControl=control)

## Warning in train.default(x, y, weights = w, ...): The metric "Accuracy" was not
## in the result set. ROC will be used instead.

modelsvm

## Support Vector Machines with Radial Basis Function Kernel 
## 
## 714 samples
##   7 predictor
##   2 classes: 'died', 'surv' 
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 714, 714, 714, 714, 714, 714, ... 
## Resampling results across tuning parameters:
## 
##   C     ROC        Sens       Spec     
##   0.25  0.8703474  0.8735475  0.7602162
##   0.50  0.8706929  0.8858278  0.7456306
##   1.00  0.8655619  0.8941179  0.7327856
## 
## Tuning parameter 'sigma' was held constant at a value of 0.2282701
## ROC was used to select the optimal model using the largest value.
## The final values used for the model were sigma = 0.2282701 and C = 0.5.

And let’s get the confusion matrix.

pred<-predict(modelsvm,test1)
confusionMatrix(pred,test1$Survived)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction died surv
##       died  101   27
##       surv    8   41
##                                           
##                Accuracy : 0.8023          
##                  95% CI : (0.7359, 0.8582)
##     No Information Rate : 0.6158          
##     P-Value [Acc > NIR] : 7.432e-08       
##                                           
##                   Kappa : 0.5589          
##                                           
##  Mcnemar's Test P-Value : 0.002346        
##                                           
##             Sensitivity : 0.9266          
##             Specificity : 0.6029          
##          Pos Pred Value : 0.7891          
##          Neg Pred Value : 0.8367          
##              Prevalence : 0.6158          
##          Detection Rate : 0.5706          
##    Detection Prevalence : 0.7232          
##       Balanced Accuracy : 0.7648          
##                                           
##        'Positive' Class : died            
##

we see that the accuracy fo this model is much higher with 80.23% than the knn model with 77.97% (the modelknn1). If we have a large number of models to be compared, there exists a function in caret called resamples to compare between models,but the models should have the same tarincontrol prameter values.

comp<-resamples(list( svm = modelsvm,
                         knn = modelknn1))

summary(comp)

## 
## Call:
## summary.resamples(object = comp)
## 
## Models: svm, knn 
## Number of resamples: 25 
## 
## ROC 
##          Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## svm 0.8472858 0.8617944 0.8691093 0.8706929 0.8744979 0.9043001    0
## knn 0.8298966 0.8577167 0.8670815 0.8686459 0.8792487 0.9135638    0
## 
## Sens 
##          Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## svm 0.8117647 0.8666667 0.8870056 0.8858278 0.9030303 0.9559748    0
## knn 0.8266667 0.8523490 0.8816568 0.8756557 0.8950617 0.9117647    0
## 
## Spec 
##          Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## svm 0.6774194 0.7096774 0.7428571 0.7456306 0.7714286 0.8425926    0
## knn 0.5865385 0.6741573 0.6989247 0.7055821 0.7252747 0.8191489    0

we can also plot the models’ matric values togather.

dotplot(comp,metric="ROC")

8 Regression

First we call the BostonHousing data.

library(mlbench)
data("BostonHousing")
glimpse(BostonHousing)

## Rows: 506
## Columns: 14
## $ crim    <dbl> 0.00632, 0.02731, 0.02729, 0.03237, 0.06905, 0.02985, 0.088...
## $ zn      <dbl> 18.0, 0.0, 0.0, 0.0, 0.0, 0.0, 12.5, 12.5, 12.5, 12.5, 12.5...
## $ indus   <dbl> 2.31, 7.07, 7.07, 2.18, 2.18, 2.18, 7.87, 7.87, 7.87, 7.87,...
## $ chas    <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ nox     <dbl> 0.538, 0.469, 0.469, 0.458, 0.458, 0.458, 0.524, 0.524, 0.5...
## $ rm      <dbl> 6.575, 6.421, 7.185, 6.998, 7.147, 6.430, 6.012, 6.172, 5.6...
## $ age     <dbl> 65.2, 78.9, 61.1, 45.8, 54.2, 58.7, 66.6, 96.1, 100.0, 85.9...
## $ dis     <dbl> 4.0900, 4.9671, 4.9671, 6.0622, 6.0622, 6.0622, 5.5605, 5.9...
## $ rad     <dbl> 1, 2, 2, 3, 3, 3, 5, 5, 5, 5, 5, 5, 5, 4, 4, 4, 4, 4, 4, 4,...
## $ tax     <dbl> 296, 242, 242, 222, 222, 222, 311, 311, 311, 311, 311, 311,...
## $ ptratio <dbl> 15.3, 17.8, 17.8, 18.7, 18.7, 18.7, 15.2, 15.2, 15.2, 15.2,...
## $ b       <dbl> 396.90, 396.90, 392.83, 394.63, 396.90, 394.12, 395.60, 396...
## $ lstat   <dbl> 4.98, 9.14, 4.03, 2.94, 5.33, 5.21, 12.43, 19.15, 29.93, 17...
## $ medv    <dbl> 24.0, 21.6, 34.7, 33.4, 36.2, 28.7, 22.9, 27.1, 16.5, 18.9,...

We will train a knn model to this data using the continuous variable as target medv

set.seed(1234)
index<-sample(nrow(BostonHousing),size = floor(0.8*(nrow(BostonHousing))))
train<-BostonHousing[index,]
test<-BostonHousing[-index,]

scaled<-preProcess(train[,-14],method=c("center","scale"))
trainscaled<-predict(scaled,train)
testscaled<-predict(scaled,test)

We are ready now to train our model.

set.seed(123)
modelknnR <- train(medv~., data=trainscaled,
                method = "knn",
                tuneGrid = expand.grid(k=1:60))
modelknnR

## k-Nearest Neighbors 
## 
## 404 samples
##  13 predictor
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 404, 404, 404, 404, 404, 404, ... 
## Resampling results across tuning parameters:
## 
##   k   RMSE      Rsquared   MAE     
##    1  4.711959  0.7479439  3.047925
##    2  4.600795  0.7545325  3.010235
##    3  4.554112  0.7583915  3.001404
##    4  4.416511  0.7733563  2.939100
##    5  4.414384  0.7736985  2.953741
##    6  4.405364  0.7758010  2.962082
##    7  4.375360  0.7799181  2.955250
##    8  4.409134  0.7773310  2.975489
##    9  4.427529  0.7770847  2.973016
##   10  4.414577  0.7804842  2.957983
##   11  4.447188  0.7787709  2.968389
##   12  4.475134  0.7767642  2.984709
##   13  4.489486  0.7760909  3.000489
##   14  4.518792  0.7746895  3.026858
##   15  4.554107  0.7717809  3.043645
##   16  4.583672  0.7694136  3.058097
##   17  4.599290  0.7695640  3.067001
##   18  4.632439  0.7671729  3.079895
##   19  4.670589  0.7643210  3.098643
##   20  4.708318  0.7614855  3.118593
##   21  4.736963  0.7596509  3.137784
##   22  4.756688  0.7590899  3.151654
##   23  4.781692  0.7577281  3.166203
##   24  4.813669  0.7554223  3.186575
##   25  4.843954  0.7533415  3.200120
##   26  4.872096  0.7513071  3.224031
##   27  4.896463  0.7502052  3.238489
##   28  4.920242  0.7497138  3.252959
##   29  4.944899  0.7484320  3.269227
##   30  4.966726  0.7479621  3.282756
##   31  4.996149  0.7460973  3.303607
##   32  5.024602  0.7438775  3.321013
##   33  5.055147  0.7420656  3.338457
##   34  5.083713  0.7403972  3.360867
##   35  5.108994  0.7388352  3.373694
##   36  5.132420  0.7372288  3.389177
##   37  5.156841  0.7354463  3.409025
##   38  5.175413  0.7349417  3.422294
##   39  5.196438  0.7340164  3.434986
##   40  5.225990  0.7314822  3.452499
##   41  5.249335  0.7299159  3.467267
##   42  5.275185  0.7281473  3.484101
##   43  5.300558  0.7263045  3.502388
##   44  5.322795  0.7251719  3.519220
##   45  5.349383  0.7232707  3.539266
##   46  5.376209  0.7210830  3.560509
##   47  5.398400  0.7199706  3.580476
##   48  5.424020  0.7180096  3.595497
##   49  5.445069  0.7166620  3.609308
##   50  5.469650  0.7145816  3.625718
##   51  5.492104  0.7127439  3.644329
##   52  5.515714  0.7107894  3.659286
##   53  5.535354  0.7092366  3.672172
##   54  5.562260  0.7063225  3.690854
##   55  5.581394  0.7049997  3.705917
##   56  5.600579  0.7036881  3.720464
##   57  5.623071  0.7018951  3.739874
##   58  5.645828  0.6999889  3.755824
##   59  5.662777  0.6990085  3.771570
##   60  5.682182  0.6976068  3.787733
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 7.

The best model with k=7 for which the minimum RMSE is about 4.3757.

We can also get the importance of the predictors.

plot(varImp(modelknnR))

Then we get the prediction and the root mean squared error RMSE as follows.

pred<-predict(modelknnR,testscaled)
head(pred)

## [1] 24.94286 29.88571 20.67143 20.31429 19.18571 20.28571

RMSE(pred,test$medv)

## [1] 4.416328

The RMSE using the test set is about 4.4163 which is slightly greater than that of the training set 4.3757 . Finally we can plot the predicted values vs the observed values to get insight about their relationship.

ggplot(data.frame(predicted=pred,observed=test$medv),aes(pred,test$medv))+
  geom_point(col="blue")+
  geom_abline(col="red")+
  ggtitle("actual values vs predicted values")

Methods for dealing with imbalanced data

Wed, 10 Apr 2019 00:00:00 +0000

1 Introduction

The imbalanced data is the common feature of some type of data such as fraudulent credit card where the number of fraudulent cards is usually very small compared to the number of non fraudulent cards. The problem with imbalanced data is that the model being trained would be dominated by the majority class such as knn and svm models, and hence they would predict the majority class more effectively than the minority class which in turn would result in high value for sensitivity rate and low value for specificity rate (in binary classification).

The simple technique to reduce the negative impact of this problem is by subsampling the data. the common subsampling methods used in practice are the following.

Upsampling: this method increases the size of the minority class by sampling with replacement so that the classes will have the same size.
Downsampling: in contrast to the above method, this one decreases the size of the majority class to be the same or closer to the minority class size by just taking out a random sample.
Hybrid methods : The well known hybrid methods are ROSE (Random oversampling examples), and SMOTE (Synthetic minority oversampling technique), they downsample the majority class, and creat new artificial points in the minority class. For more detail about SMOTE method click here, and for ROSE click here.

Note: all the above methods should be applied only on the training set , the testing set must be never touched until the final model evaluation step.

Some type of models can handle imbalanced data such as deep learning model with the argument class_weight wich adds more weights to the minority class cases. Other models, however, such as svm or knn we have to make use of one of the above methods before training these type of models.

In this article we will make use of the creditcard data from kaggle website -click here to upload this data, which is highly imbalanced- and we will train a logistic regression model on the raw data and on the transformed data after applying the above methods and comparing the results. Also, we will use a simple deep learning model with and without taking into account the imbalanced problem.

First we call the data.

spsm(library(tidyverse))

## Warning: package 'ggplot2' was built under R version 4.0.2

## Warning: package 'tibble' was built under R version 4.0.2

## Warning: package 'tidyr' was built under R version 4.0.2

## Warning: package 'dplyr' was built under R version 4.0.2

data <- read.csv("../sparklyr/creditcard.csv", header = TRUE)

For privacy purposes the original features are replaced by the PCA variables from v1 to v28 and only Time and Amount features that are left from the original features.

Let’s first check Class variable levels frequency (after having been converted to a factor type).

data$Class <- as.factor(data$Class)
prop.table(table(data$Class))

## 
##           0           1 
## 0.998272514 0.001727486

As we see the minority class number “1” is only about 0.17% of the total cases. We also need to show the summary of the data to take an overall look at all the features to be aware of missing values or unusual outliers.

summary(data)

##       Time              V1                  V2                  V3          
##  Min.   :     0   Min.   :-56.40751   Min.   :-72.71573   Min.   :-48.3256  
##  1st Qu.: 54202   1st Qu.: -0.92037   1st Qu.: -0.59855   1st Qu.: -0.8904  
##  Median : 84692   Median :  0.01811   Median :  0.06549   Median :  0.1799  
##  Mean   : 94814   Mean   :  0.00000   Mean   :  0.00000   Mean   :  0.0000  
##  3rd Qu.:139321   3rd Qu.:  1.31564   3rd Qu.:  0.80372   3rd Qu.:  1.0272  
##  Max.   :172792   Max.   :  2.45493   Max.   : 22.05773   Max.   :  9.3826  
##        V4                 V5                   V6                 V7          
##  Min.   :-5.68317   Min.   :-113.74331   Min.   :-26.1605   Min.   :-43.5572  
##  1st Qu.:-0.84864   1st Qu.:  -0.69160   1st Qu.: -0.7683   1st Qu.: -0.5541  
##  Median :-0.01985   Median :  -0.05434   Median : -0.2742   Median :  0.0401  
##  Mean   : 0.00000   Mean   :   0.00000   Mean   :  0.0000   Mean   :  0.0000  
##  3rd Qu.: 0.74334   3rd Qu.:   0.61193   3rd Qu.:  0.3986   3rd Qu.:  0.5704  
##  Max.   :16.87534   Max.   :  34.80167   Max.   : 73.3016   Max.   :120.5895  
##        V8                  V9                 V10                 V11          
##  Min.   :-73.21672   Min.   :-13.43407   Min.   :-24.58826   Min.   :-4.79747  
##  1st Qu.: -0.20863   1st Qu.: -0.64310   1st Qu.: -0.53543   1st Qu.:-0.76249  
##  Median :  0.02236   Median : -0.05143   Median : -0.09292   Median :-0.03276  
##  Mean   :  0.00000   Mean   :  0.00000   Mean   :  0.00000   Mean   : 0.00000  
##  3rd Qu.:  0.32735   3rd Qu.:  0.59714   3rd Qu.:  0.45392   3rd Qu.: 0.73959  
##  Max.   : 20.00721   Max.   : 15.59500   Max.   : 23.74514   Max.   :12.01891  
##       V12                V13                V14                V15          
##  Min.   :-18.6837   Min.   :-5.79188   Min.   :-19.2143   Min.   :-4.49894  
##  1st Qu.: -0.4056   1st Qu.:-0.64854   1st Qu.: -0.4256   1st Qu.:-0.58288  
##  Median :  0.1400   Median :-0.01357   Median :  0.0506   Median : 0.04807  
##  Mean   :  0.0000   Mean   : 0.00000   Mean   :  0.0000   Mean   : 0.00000  
##  3rd Qu.:  0.6182   3rd Qu.: 0.66251   3rd Qu.:  0.4931   3rd Qu.: 0.64882  
##  Max.   :  7.8484   Max.   : 7.12688   Max.   : 10.5268   Max.   : 8.87774  
##       V16                 V17                 V18           
##  Min.   :-14.12985   Min.   :-25.16280   Min.   :-9.498746  
##  1st Qu.: -0.46804   1st Qu.: -0.48375   1st Qu.:-0.498850  
##  Median :  0.06641   Median : -0.06568   Median :-0.003636  
##  Mean   :  0.00000   Mean   :  0.00000   Mean   : 0.000000  
##  3rd Qu.:  0.52330   3rd Qu.:  0.39968   3rd Qu.: 0.500807  
##  Max.   : 17.31511   Max.   :  9.25353   Max.   : 5.041069  
##       V19                 V20                 V21           
##  Min.   :-7.213527   Min.   :-54.49772   Min.   :-34.83038  
##  1st Qu.:-0.456299   1st Qu.: -0.21172   1st Qu.: -0.22839  
##  Median : 0.003735   Median : -0.06248   Median : -0.02945  
##  Mean   : 0.000000   Mean   :  0.00000   Mean   :  0.00000  
##  3rd Qu.: 0.458949   3rd Qu.:  0.13304   3rd Qu.:  0.18638  
##  Max.   : 5.591971   Max.   : 39.42090   Max.   : 27.20284  
##       V22                  V23                 V24          
##  Min.   :-10.933144   Min.   :-44.80774   Min.   :-2.83663  
##  1st Qu.: -0.542350   1st Qu.: -0.16185   1st Qu.:-0.35459  
##  Median :  0.006782   Median : -0.01119   Median : 0.04098  
##  Mean   :  0.000000   Mean   :  0.00000   Mean   : 0.00000  
##  3rd Qu.:  0.528554   3rd Qu.:  0.14764   3rd Qu.: 0.43953  
##  Max.   : 10.503090   Max.   : 22.52841   Max.   : 4.58455  
##       V25                 V26                V27            
##  Min.   :-10.29540   Min.   :-2.60455   Min.   :-22.565679  
##  1st Qu.: -0.31715   1st Qu.:-0.32698   1st Qu.: -0.070840  
##  Median :  0.01659   Median :-0.05214   Median :  0.001342  
##  Mean   :  0.00000   Mean   : 0.00000   Mean   :  0.000000  
##  3rd Qu.:  0.35072   3rd Qu.: 0.24095   3rd Qu.:  0.091045  
##  Max.   :  7.51959   Max.   : 3.51735   Max.   : 31.612198  
##       V28                Amount         Class     
##  Min.   :-15.43008   Min.   :    0.00   0:284315  
##  1st Qu.: -0.05296   1st Qu.:    5.60   1:   492  
##  Median :  0.01124   Median :   22.00             
##  Mean   :  0.00000   Mean   :   88.35             
##  3rd Qu.:  0.07828   3rd Qu.:   77.17             
##  Max.   : 33.84781   Max.   :25691.16

looking at this summary, we do not have any critical issues like missing values for instance.

2 Data partition

Before applying any subsampling method we split the data first between the training set and the testing set and we use only the former to be subsampled.

spsm(library(caret))
set.seed(1234)
index <- createDataPartition(data$Class, p = 0.8, list = FALSE)
train <- data[index, ]
test <- data[-index, ]

3 Subsampling the training data

3.1 Upsampling

The caret package provides a function called upSample to perform upsampling technique.

set.seed(111)
trainup <- upSample(x = train[, -ncol(train)], y = train$Class)
table(trainup$Class)

## 
##      0      1 
## 227452 227452

As we see the two classes now have the same size 227452

3.2 downsampling

By the some way we make use of the caret function downSample

set.seed(111)
traindown <- downSample(x = train[, -ncol(train)], y = train$Class)
table(traindown$Class)

## 
##   0   1 
## 394 394

now the size of each class is 394

3.3 ROSE

To use this technique we have to call the ROSE package

spsm(library(ROSE))
set.seed(111)
trainrose <- ROSE(Class ~ ., data = train)$data
table(trainrose$Class)

## 
##      0      1 
## 113827 114019

since this technique add new synthetic data points to the minority class and daownsamples the majority class the size now is about 114019 for minority class and 113827 for the majority class.

3.4 SMOTE

this technique requires the DMwR package.

spsm(library(DMwR))
set.seed(111)
trainsmote <- SMOTE(Class ~ ., data = train)
table(trainsmote$Class)

## 
##    0    1 
## 1576 1182

The size of the majority class is 113827 and for the minority class is 114019 .

4 training logistic regression model.

we are now ready to fit logit model to the original training set without subsampling, and to each of the above subsampled training sets.

4.1 without subsampling

set.seed(123)
model <- glm(Class ~ ., data = train, family = "binomial")
summary(model)

## 
## Call:
## glm(formula = Class ~ ., family = "binomial", data = train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -4.9290  -0.0291  -0.0190  -0.0124   4.6028  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -8.486e+00  2.852e-01 -29.753  < 2e-16 ***
## Time        -2.673e-06  2.528e-06  -1.057  0.29037    
## V1           9.397e-02  4.794e-02   1.960  0.04996 *  
## V2           1.097e-02  6.706e-02   0.164  0.87006    
## V3           1.290e-03  5.949e-02   0.022  0.98270    
## V4           6.851e-01  8.408e-02   8.148 3.69e-16 ***
## V5           1.472e-01  7.301e-02   2.017  0.04372 *  
## V6          -8.450e-02  7.902e-02  -1.069  0.28491    
## V7          -1.098e-01  7.591e-02  -1.446  0.14816    
## V8          -1.718e-01  3.402e-02  -5.050 4.41e-07 ***
## V9          -1.926e-01  1.258e-01  -1.531  0.12579    
## V10         -8.073e-01  1.118e-01  -7.224 5.07e-13 ***
## V11         -3.920e-03  9.131e-02  -0.043  0.96575    
## V12          2.855e-02  9.432e-02   0.303  0.76210    
## V13         -3.064e-01  9.007e-02  -3.401  0.00067 ***
## V14         -5.308e-01  6.816e-02  -7.787 6.86e-15 ***
## V15         -1.285e-01  9.559e-02  -1.344  0.17903    
## V16         -2.164e-01  1.423e-01  -1.520  0.12840    
## V17          2.913e-02  7.729e-02   0.377  0.70624    
## V18         -3.642e-02  1.445e-01  -0.252  0.80095    
## V19          6.064e-02  1.094e-01   0.554  0.57938    
## V20         -4.449e-01  9.737e-02  -4.570 4.89e-06 ***
## V21          3.661e-01  6.709e-02   5.456 4.87e-08 ***
## V22          5.965e-01  1.519e-01   3.927 8.59e-05 ***
## V23         -1.157e-01  6.545e-02  -1.768  0.07706 .  
## V24          8.146e-02  1.625e-01   0.501  0.61622    
## V25          4.325e-02  1.482e-01   0.292  0.77043    
## V26         -2.679e-01  2.226e-01  -1.203  0.22893    
## V27         -7.280e-01  1.542e-01  -4.720 2.36e-06 ***
## V28         -2.817e-01  9.864e-02  -2.856  0.00429 ** 
## Amount       9.154e-04  4.379e-04   2.091  0.03656 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 5799.1  on 227845  degrees of freedom
## Residual deviance: 1768.0  on 227815  degrees of freedom
## AIC: 1830
## 
## Number of Fisher Scoring iterations: 12

At this step and to make things more simpler, we remove the insignificant variables (without asterix) and we keep the remaining ones to use in all the following models.

set.seed(123)
model1 <- glm(Class ~ . - Time - V2 - V3 - V6 - V7 - V9 - V11 - V12 - V15 - V16 - 
    V17 - V18 - V19 - V24 - V25 - V26, data = train, family = "binomial")
summary(model1)

## 
## Call:
## glm(formula = Class ~ . - Time - V2 - V3 - V6 - V7 - V9 - V11 - 
##     V12 - V15 - V16 - V17 - V18 - V19 - V24 - V25 - V26, family = "binomial", 
##     data = train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -4.6514  -0.0290  -0.0186  -0.0117   4.6192  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -8.763e+00  1.510e-01 -58.023  < 2e-16 ***
## V1           2.108e-02  2.918e-02   0.722 0.470129    
## V4           7.241e-01  6.306e-02  11.483  < 2e-16 ***
## V5           9.934e-02  3.566e-02   2.785 0.005346 ** 
## V8          -1.549e-01  2.178e-02  -7.115 1.12e-12 ***
## V10         -9.290e-01  9.305e-02  -9.985  < 2e-16 ***
## V13         -3.307e-01  8.577e-02  -3.855 0.000116 ***
## V14         -5.229e-01  5.566e-02  -9.396  < 2e-16 ***
## V20         -2.388e-01  6.005e-02  -3.976 7.01e-05 ***
## V21          4.811e-01  5.259e-02   9.148  < 2e-16 ***
## V22          7.675e-01  1.277e-01   6.011 1.84e-09 ***
## V23         -1.522e-01  5.925e-02  -2.569 0.010212 *  
## V27         -6.381e-01  1.295e-01  -4.927 8.34e-07 ***
## V28         -2.485e-01  9.881e-02  -2.515 0.011900 *  
## Amount       2.713e-07  1.290e-04   0.002 0.998323    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 5799.1  on 227845  degrees of freedom
## Residual deviance: 1798.7  on 227831  degrees of freedom
## AIC: 1828.7
## 
## Number of Fisher Scoring iterations: 11

We have now two predictors that are non significant V1 and Amount, they should be also removed.

set.seed(123)
finalmodel <- glm(Class ~ . - Time - V1 - V2 - V3 - V6 - V7 - V9 - V11 - V12 - V15 - 
    V16 - V17 - V18 - V19 - V24 - V25 - V26 - Amount, data = train, family = "binomial")
summary(finalmodel)

## 
## Call:
## glm(formula = Class ~ . - Time - V1 - V2 - V3 - V6 - V7 - V9 - 
##     V11 - V12 - V15 - V16 - V17 - V18 - V19 - V24 - V25 - V26 - 
##     Amount, family = "binomial", data = train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -4.6285  -0.0289  -0.0186  -0.0117   4.5835  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -8.75058    0.14706 -59.505  < 2e-16 ***
## V4           0.69955    0.05265  13.288  < 2e-16 ***
## V5           0.10650    0.02586   4.119 3.81e-05 ***
## V8          -0.15525    0.01982  -7.833 4.76e-15 ***
## V10         -0.89573    0.07630 -11.740  < 2e-16 ***
## V13         -0.33583    0.08448  -3.975 7.02e-05 ***
## V14         -0.54238    0.04862 -11.155  < 2e-16 ***
## V20         -0.22318    0.04781  -4.668 3.04e-06 ***
## V21          0.47912    0.05205   9.204  < 2e-16 ***
## V22          0.78631    0.12439   6.321 2.60e-10 ***
## V23         -0.15046    0.05498  -2.736  0.00621 ** 
## V27         -0.58832    0.10411  -5.651 1.60e-08 ***
## V28         -0.23592    0.08901  -2.651  0.00804 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 5799.1  on 227845  degrees of freedom
## Residual deviance: 1799.2  on 227833  degrees of freedom
## AIC: 1825.2
## 
## Number of Fisher Scoring iterations: 11

For the other training sets we will use only these significant predictors from the above model.

Now let’s get the final results from the confusion matrix.

pred <- predict(finalmodel, test, type = "response")
pred <- as.integer(pred > 0.5)
confusionMatrix(as.factor(pred), test$Class)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction     0     1
##          0 56856    41
##          1     7    57
##                                           
##                Accuracy : 0.9992          
##                  95% CI : (0.9989, 0.9994)
##     No Information Rate : 0.9983          
##     P-Value [Acc > NIR] : 1.581e-08       
##                                           
##                   Kappa : 0.7033          
##                                           
##  Mcnemar's Test P-Value : 1.906e-06       
##                                           
##             Sensitivity : 0.9999          
##             Specificity : 0.5816          
##          Pos Pred Value : 0.9993          
##          Neg Pred Value : 0.8906          
##              Prevalence : 0.9983          
##          Detection Rate : 0.9982          
##    Detection Prevalence : 0.9989          
##       Balanced Accuracy : 0.7908          
##                                           
##        'Positive' Class : 0               
##

As we see we have a large accuracy rate about 99.92%. However, this rate is almost the same as the no information rate 99.83% (if we predict all the cases as class label 0). In other words this high rate is not due to the quality of the model but rather due to the imbalanced classes. if we look at the specificity rate. it is about 58.16% indicating that the model poorly predict the fraudulent cards which is the most important class label that we want to predict correctly. Among the available metrics, the best one for imbalanced data is cohen’s kappa statistic. and according to the scale of kappa value interpretation suggested by Landis & Koch (1977), the kappa value obtained here 0.7033 is a good score.

But here we stick with accuracy rate for pedagogic purposes to show the effectiveness of the above discussed methods.

4.2 Upsampling the train set

Now let’s use the training data resulted from the upsmpling method.

set.seed(123)
modelup <- glm(Class ~ V4 + V5 + V8 + V10 + V13 + V14 + V20 + V21 + V22 + V23 + V27 + 
    V28, data = trainup, family = "binomial")

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

summary(modelup)

## 
## Call:
## glm(formula = Class ~ V4 + V5 + V8 + V10 + V13 + V14 + V20 + 
##     V21 + V22 + V23 + V27 + V28, family = "binomial", data = trainup)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -6.2906  -0.2785  -0.0001   0.0159   2.8055  
## 
## Coefficients:
##              Estimate Std. Error  z value Pr(>|z|)    
## (Intercept) -3.271053   0.011741 -278.610  < 2e-16 ***
## V4           0.952941   0.005478  173.966  < 2e-16 ***
## V5           0.126627   0.003976   31.846  < 2e-16 ***
## V8          -0.289448   0.004368  -66.261  < 2e-16 ***
## V10         -0.710629   0.009150  -77.665  < 2e-16 ***
## V13         -0.479344   0.007352  -65.200  < 2e-16 ***
## V14         -0.802941   0.006825 -117.638  < 2e-16 ***
## V20         -0.090453   0.007955  -11.371  < 2e-16 ***
## V21          0.233604   0.007702   30.332  < 2e-16 ***
## V22          0.209203   0.010125   20.662  < 2e-16 ***
## V23         -0.320073   0.005299  -60.399  < 2e-16 ***
## V27         -0.238132   0.017019  -13.992  < 2e-16 ***
## V28         -0.152294   0.019922   -7.644  2.1e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 630631  on 454903  degrees of freedom
## Residual deviance: 136321  on 454891  degrees of freedom
## AIC: 136347
## 
## Number of Fisher Scoring iterations: 9

pred <- predict(modelup, test, type = "response")
pred <- as.integer(pred > 0.5)
confusionMatrix(as.factor(pred), test$Class)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction     0     1
##          0 55334    12
##          1  1529    86
##                                           
##                Accuracy : 0.9729          
##                  95% CI : (0.9716, 0.9743)
##     No Information Rate : 0.9983          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.0975          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.97311         
##             Specificity : 0.87755         
##          Pos Pred Value : 0.99978         
##          Neg Pred Value : 0.05325         
##              Prevalence : 0.99828         
##          Detection Rate : 0.97144         
##    Detection Prevalence : 0.97165         
##       Balanced Accuracy : 0.92533         
##                                           
##        'Positive' Class : 0               
##

Now we have a smaller accuracy rate 97.29%, but we have a larger specificity rate 87.75% which increases the power of the model to predict the fraudulent cards.

4.3 Down sampling the training set.

set.seed(123)
modeldown <- glm(Class ~ V4 + V5 + V8 + V10 + V13 + V14 + V20 + V21 + V22 + V23 + 
    V27 + V28, data = traindown, family = "binomial")

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

pred <- predict(modeldown, test, type = "response")
pred <- as.integer(pred > 0.5)
confusionMatrix(as.factor(pred), test$Class)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction     0     1
##          0 54837    12
##          1  2026    86
##                                           
##                Accuracy : 0.9642          
##                  95% CI : (0.9627, 0.9657)
##     No Information Rate : 0.9983          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.0748          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.96437         
##             Specificity : 0.87755         
##          Pos Pred Value : 0.99978         
##          Neg Pred Value : 0.04072         
##              Prevalence : 0.99828         
##          Detection Rate : 0.96271         
##    Detection Prevalence : 0.96292         
##       Balanced Accuracy : 0.92096         
##                                           
##        'Positive' Class : 0               
##

With downsampling method, we get approximately the same specificity rate 87.75% with a slight decrease of the over all accuracy rate 96.42%, and the sensitivity rate has decreased to 96.43% since we have decreased the majority class size by downsampling.

4.4 subsampline the train set by ROSE technique

set.seed(123)
modelrose <- glm(Class ~ V4 + V5 + V8 + V10 + V13 + V14 + V20 + V21 + V22 + V23 + 
    V27 + V28, data = trainrose, family = "binomial")
pred <- predict(modelrose, test, type = "response")
pred <- as.integer(pred > 0.5)
confusionMatrix(as.factor(pred), test$Class)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction     0     1
##          0 56080    14
##          1   783    84
##                                         
##                Accuracy : 0.986         
##                  95% CI : (0.985, 0.987)
##     No Information Rate : 0.9983        
##     P-Value [Acc > NIR] : 1             
##                                         
##                   Kappa : 0.1715        
##                                         
##  Mcnemar's Test P-Value : <2e-16        
##                                         
##             Sensitivity : 0.98623       
##             Specificity : 0.85714       
##          Pos Pred Value : 0.99975       
##          Neg Pred Value : 0.09689       
##              Prevalence : 0.99828       
##          Detection Rate : 0.98453       
##    Detection Prevalence : 0.98478       
##       Balanced Accuracy : 0.92169       
##                                         
##        'Positive' Class : 0             
##

Using this method the sensitivity rate is slightly smaller than the previous ones 85.71% but still a large improvement in predicting fraudulent cards compared to the model trained with the original imbalanced data.

4.5 Subsampling the train set by SMOTE technique

set.seed(123)
modelsmote <- glm(Class ~ V4 + V5 + V8 + V10 + V13 + V14 + V20 + V21 + V22 + V23 + 
    V27 + V28, data = trainsmote, family = "binomial")

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

pred <- predict(modelsmote, test, type = "response")
pred <- as.integer(pred > 0.5)
confusionMatrix(as.factor(pred), test$Class)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction     0     1
##          0 55457    14
##          1  1406    84
##                                           
##                Accuracy : 0.9751          
##                  95% CI : (0.9738, 0.9763)
##     No Information Rate : 0.9983          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.1029          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.97527         
##             Specificity : 0.85714         
##          Pos Pred Value : 0.99975         
##          Neg Pred Value : 0.05638         
##              Prevalence : 0.99828         
##          Detection Rate : 0.97360         
##    Detection Prevalence : 0.97384         
##       Balanced Accuracy : 0.91621         
##                                           
##        'Positive' Class : 0               
##

With this method we get the same specificity rate 85.71% such as ROSE method.

5 deep learning model (without class weight).

When we use deep learning models via some software we can assign a weight to the labels of the target variables. For us we will make use of keras package. We will first train the model without weighting the data , Then we retrain the same model after assigning weight to the minority class.
To train this model we should first convert the data (train and test sets) into numeric matrix and remove the column names (we convert also the Class to numeric type). However, in order to be inline with the above models we keep only their features, but this time it would be better to be normalized since this helps the gradient running more faster.

spsm(library(keras))
train1 <- train[, c("V4", "V5", "V8", "V10", "V13", "V14", "V20", "V21", "V22", "V23", 
    "V27", "V28", "Class")]
test1 <- test[, c("V4", "V5", "V8", "V10", "V13", "V14", "V20", "V21", "V22", "V23", 
    "V27", "V28", "Class")]
train1$Class <- as.numeric(train1$Class)
test1$Class <- as.numeric(test1$Class)
train1[, "Class"] <- train1[, "Class"] - 1
test1[, "Class"] <- test1[, "Class"] - 1
trainx <- train1[, -ncol(train1)]
testx <- test1[, -ncol(test1)]
trained <- as.matrix(trainx)
tested <- as.matrix(testx)
trainy <- train1$Class
testy <- test1$Class
dimnames(trained) <- NULL
dimnames(tested) <- NULL

then we apply one hot encoding on the target variable.

trainlabel <- to_categorical(trainy)
testlabel <- to_categorical(testy)

The final step now is normalizing the matrices (trained and tested)

trained1 <- normalize(trained)
tested1 <- normalize(tested)

Now we are ready to create the model with two hidden layers followed by dropout layers.

modeldeep <- keras_model_sequential()
modeldeep %>% layer_dense(units = 32, activation = "relu", kernel_initializer = "he_normal", 
    input_shape = c(12)) %>% layer_dropout(rate = 0.2) %>% layer_dense(units = 64, 
    activation = "relu", kernel_initializer = "he_normal") %>% layer_dropout(rate = 0.4) %>% 
    layer_dense(units = 2, activation = "sigmoid")
summary(modeldeep)

## Model: "sequential"
## ________________________________________________________________________________
## Layer (type)                        Output Shape                    Param #     
## ================================================================================
## dense (Dense)                       (None, 32)                      416         
## ________________________________________________________________________________
## dropout (Dropout)                   (None, 32)                      0           
## ________________________________________________________________________________
## dense_1 (Dense)                     (None, 64)                      2112        
## ________________________________________________________________________________
## dropout_1 (Dropout)                 (None, 64)                      0           
## ________________________________________________________________________________
## dense_2 (Dense)                     (None, 2)                       130         
## ================================================================================
## Total params: 2,658
## Trainable params: 2,658
## Non-trainable params: 0
## ________________________________________________________________________________

we will use the accuracy rate as the metric. The loss function will be binary crossentropy since we deal with binary classification problem. and for the optimizer we will use adam optimizer.

modeldeep %>% compile(loss = "binary_crossentropy", optimizer = "adam", metric = "accuracy")

During training, the model will use 10 epochs (the default), 5 sample as batch size to update the weights, and keep 20% of the inputs (training samples) out to assess the model

You can run this model many times untill you get satisfied with the results, then it will be better to save it and load it again each time you need it as follows.

modeldeep <- load_model_hdf5("modeldeep.h5")

All the above metric values are used in the training process, so they are not much reliable. The more reliable ones are those computed from unseen data.

pred <- modeldeep %>% predict_classes(tested1)
confusionMatrix(as.factor(pred), as.factor(testy))

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction     0     1
##          0 56858    64
##          1     5    34
##                                           
##                Accuracy : 0.9988          
##                  95% CI : (0.9985, 0.9991)
##     No Information Rate : 0.9983          
##     P-Value [Acc > NIR] : 0.00125         
##                                           
##                   Kappa : 0.4959          
##                                           
##  Mcnemar's Test P-Value : 2.902e-12       
##                                           
##             Sensitivity : 0.9999          
##             Specificity : 0.3469          
##          Pos Pred Value : 0.9989          
##          Neg Pred Value : 0.8718          
##              Prevalence : 0.9983          
##          Detection Rate : 0.9982          
##    Detection Prevalence : 0.9993          
##       Balanced Accuracy : 0.6734          
##                                           
##        'Positive' Class : 0               
##

The same as the above models, the specificity rate is even worst than the other models 0.3469 which is also caused by the imbalanced data.

5.1 deep learning model with class weights

Now let’s try the previous model by taking into account the class imbalance

modeldeep1 <- keras_model_sequential()
modeldeep1 %>% layer_dense(units = 32, activation = "relu", kernel_initializer = "he_normal", 
    input_shape = c(12)) %>% layer_dropout(rate = 0.2) %>% layer_dense(units = 64, 
    activation = "relu", kernel_initializer = "he_normal") %>% layer_dropout(rate = 0.4) %>% 
    layer_dense(units = 2, activation = "sigmoid")
modeldeep1 %>% compile(loss = "binary_crossentropy", optimizer = "adam", metric = "accuracy")

To define the appropriate weight, we divide the fraction of the majority class by the fraction of the minority class to get how many times the former is larger than the latter.

prop.table(table(data$Class))[1]/prop.table(table(data$Class))[2]

##       0 
## 577.876

Now we include this value as weight in the class_weight argument.

Again I should save this model before knitting the document. For you if you want to run the above code just uncomment it.

modeldeep1 <- load_model_hdf5("modeldeep1.h5")

Now let’s get the confusion matrix.

pred <- modeldeep1 %>% predict_classes(tested1)
confusionMatrix(as.factor(pred), as.factor(testy))

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction     0     1
##          0 55303    14
##          1  1560    84
##                                          
##                Accuracy : 0.9724         
##                  95% CI : (0.971, 0.9737)
##     No Information Rate : 0.9983         
##     P-Value [Acc > NIR] : 1              
##                                          
##                   Kappa : 0.0935         
##                                          
##  Mcnemar's Test P-Value : <2e-16         
##                                          
##             Sensitivity : 0.97257        
##             Specificity : 0.85714        
##          Pos Pred Value : 0.99975        
##          Neg Pred Value : 0.05109        
##              Prevalence : 0.99828        
##          Detection Rate : 0.97089        
##    Detection Prevalence : 0.97114        
##       Balanced Accuracy : 0.91485        
##                                          
##        'Positive' Class : 0              
##

Using this model we get less accuracy rate 0.9724, but the specificity rate is higher compared to the previous model so that this model can well predict the negative class label as well as the postive class label.

6 Conclusion

With the imbalanced data most machine learning model tend to more efficiently predict the majority class than the minority class. To correct thus this behavior we can use one of the above discussed methods to get more closer accuracy rates between classes. However, deep learning model can easily handle this problem by specifying the class weights.

An example preprint / working paper

Sun, 07 Apr 2019 00:00:00 +0000

Click the Slides button above to demo Academic’s Markdown slides feature.

Supplementary notes can be added here, including code and math.

Slides

Tue, 05 Feb 2019 00:00:00 +0000

Create slides in Markdown with Academic

Academic | Documentation

Features

Efficiently write slides in Markdown
3-in-1: Create, Present, and Publish your slides
Supports speaker notes
Mobile friendly slides

Controls

Next: Right Arrow or Space
Previous: Left Arrow
Start: Home
Finish: End
Overview: Esc
Speaker notes: S
Fullscreen: F
Zoom: Alt + Click
PDF Export: E

Code Highlighting

Inline code: variable

Code block:

porridge = "blueberry"
if porridge == "blueberry":
    print("Eating...")

Math

In-line math: $x + y = z$

Block math:

$$ f\left( x \right) = ;\frac{{2\left( {x + 4} \right)\left( {x - 4} \right)}}{{\left( {x + 4} \right)\left( {x + 1} \right)}} $$

Fragments

Make content appear incrementally

{{% fragment %}} One {{% /fragment %}}
{{% fragment %}} **Two** {{% /fragment %}}
{{% fragment %}} Three {{% /fragment %}}

Press Space to play!

One Two Three

A fragment can accept two optional parameters:

class: use a custom style (requires definition in custom CSS)
weight: sets the order in which a fragment appears

Speaker Notes

Add speaker notes to your presentation

{{% speaker_note %}}
- Only the speaker can read these notes
- Press `S` key to view
{{% /speaker_note %}}

Press the S key to view the speaker notes!

Themes

black: Black background, white text, blue links (default)
white: White background, black text, blue links
league: Gray background, white text, blue links
beige: Beige background, dark text, brown links
sky: Blue background, thin dark text, blue links

night: Black background, thick white text, orange links
serif: Cappuccino background, gray text, brown links
simple: White background, black text, blue links
solarized: Cream-colored background, dark green text, blue links

Custom Slide

Customize the slide style and background

{{< slide background-image="/img/boards.jpg" >}}
{{< slide background-color="#0000FF" >}}
{{< slide class="my-style" >}}

Custom CSS Example

Let’s make headers navy colored.

Create assets/css/reveal_custom.css with:

.reveal section h1,
.reveal section h2,
.reveal section h3 {
  color: navy;
}

Questions?

Ask

Documentation

Introduction to sparklyr

Wed, 23 Jan 2019 00:00:00 +0000

1 Introduction

The programming language R has very powerful tools and functions to do almost every thing we want to do, such as wrangling , visualizing, modeling…etc. However, R such as all the classical languages, requires the whole data to be completely loaded into its memory before doing anything, and this is a big disadvantage when we deal with large data set using less powerful machine, so that any even small data manipulation is time consuming, and may be in some cases the data size can exceed the memory size and R fails even to load the data.

However, there are two widely used engines for this type of data hadoop and spark which both use a distributed system to partition the data into different storage locations and distribute any computation processes among different machines (computing clusters), or among different CPU’s inside a single machine.

Spark is more recent and recognized to be more faster than hadoop (2010). scala is its native language, but it can also support SQL and java. If you do not know neither spark nor hadoop it would be obvious to choose spark . However, if you are R user and you do not want to spent time to learn the spark languages (scala, or sql) good news for you is that sparklyr package (or sparkR) is R interface for spark from which you can use the most of the R codes and other functions from some packages such as dplyr …etc.

In this paper we will go step by step to learn how to use sparklyr by making use of some examples .

2 Installing sparklyr

Such as any R package we call the function install.packages to install sparklyr, but before that make sure you have java installed in your system since the programming language scala is run by the java virtual machine.

#install.packages("sparklyr")

3 Installing spark

We have deliberately installed sparklyr before spark to provide us with the function spark_install() that downloads, installs, and configures the latest version of spark at once.

#spark_install()

4 Connecting to spark

Usually, spark is designed to create a clusters using multiple machines either physical machines or virtual machines (in the cloud). However, it can also create a local cluster in your single machine by making use of the CPU’s, if exist in this machine, to speed up the data processing.

Wherever the clusters are created (local or in cloud), the data processing functions work in the same way, and the only difference is how to create and interact with these clusters. Since this is the case, then we can get started in our local cluster to learn the most basic things of data science such as importing, analyzing, visualizing data, and perform machine learning models using spark via sparklyr.

To connect to spark in the local mode we use the function spark_connect as follows.

library(sparklyr)
library(tidyverse)
sc<-spark_connect(master = "local")

5 Importing data

If the data is build-in R we load it to the spark memory using the function copy_to.

mydata<-copy_to(sc,airquality)

Then R can get access to this data by the help of sparklyr, for example we can use the dplyr function glimpse.

glimpse(mydata)

## Rows: ??
## Columns: 6
## Database: spark_connection
## $ Ozone   <int> 41, 36, 12, 18, NA, 28, 23, 19, 8, NA, 7, 16, 11, 14, 18, 1...
## $ Solar_R <int> 190, 118, 149, 313, NA, NA, 299, 99, 19, 194, NA, 256, 290,...
## $ Wind    <dbl> 7.4, 8.0, 12.6, 11.5, 14.3, 14.9, 8.6, 13.8, 20.1, 8.6, 6.9...
## $ Temp    <int> 67, 72, 74, 62, 56, 66, 65, 59, 61, 69, 74, 69, 66, 68, 58,...
## $ Month   <int> 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5,...
## $ Day     <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, ...

And if the data is stored anywhere outside R with any different format, then sparklyr provides some functions to import these data. For example to load csv file we use the function spark_read_csv, and for json we use spark_read_json. To get the list of all the sparklyr functions and their usages click here.

For illustration we will call the data creditcards stored in my machine as follows

card<-spark_read_csv(sc,"creditcard.csv")
sdf_dim(card)

## [1] 284807     31

As you see using the same connection sc we load two data mydata and card

if we want to show what is going on in spark we call the function spark_web() that lead us to the spark website

#spark_web(sc)

6 Manipulating data

With the help of sparklyr, we can access very easily to the data into spark memory by using the dplyr functions. Let’s apply some manipulations on the data card like, for instance, filtering the data using the variable Time , then computing the mean of Amount for each class label in the variable Class.

card %>%
  filter(Time <= mean(Time,na.rm = TRUE))%>%
      group_by(Class)%>%
  summarise(Class_avg=mean(Amount,na.rm=TRUE))

## # Source: spark<?> [?? x 2]
##   Class Class_avg
##   <int>     <dbl>
## 1     0      89.0
## 2     1     117.

As you can see now the output is a very small table which can moved from spark memory into R memory for further analysis by making use of the function collect. In other words, if you feel with ease in R then each spark output that is small enough to be processed with R add this function at the end of your script before running it to bring this output into R. For example we cannot use the function plot to plot the above table, that is why we should fist pull this output into R then apply the function plot as follows

card %>%
  filter(Time <= mean(Time,na.rm = TRUE))%>%
      group_by(Class)%>%
  summarise(Class_avg=mean(Amount,na.rm=TRUE))%>%
  collect()%>%
  plot(col="red",pch=19,main = "Class average vs Class")

However , we can plot the sparklyr outputs without having to remove them to R memory by using the dbplot functions, since most of the functions of this package are supported by sparklyr. Let’s for example plot the mean of Amount by Class for cards transaction that have time less than the mean.

library(dbplot)
card %>%
  filter(Time <= mean(Time,na.rm = TRUE))%>%
        dbplot_bar(Class,mean(Amount))

## Warning: Missing values are always removed in SQL.
## Use `mean(x, na.rm = TRUE)` to silence this warning
## This warning is displayed only once per session.

As we see the Amount mean of fraudulent cards is higher than that of regular cards.

7 Disconnecting

each time you finish your work think to disconnect from spark to save your resources as follows.

#spark_disconnect(sc)

8 saving data

Sparklyr provides functions to save files directly from spark memory into our directory. For example, to save data in csv file we use spark function spark_write_csv (we can save in other type of formats such as spark_write_parquet,…etc) as follows

#spark_write_csv(card,"card.csv")

9 Example of modeling in spark

For machine learning models spark has its own library MLlib that has almost every thing we need so that we do not need the library caret.

To illustrate how do we perform a machine learning model, we train a logistic regression model to predict the fraudulent cards form the data card.

first let’s split the data between training set and testing set as follows, and to do this we use the function sdf_random_split as follows

partitions<-card%>%
  sdf_random_split(training=0.8,test=0.2,seed = 123)
train<-partitions$training
test<-partitions$test

Now we will use the set train to train our model, and for the model performance we make use of the set test.

model_in_spark<-train %>%
  ml_logistic_regression(Class~.)

we can get the summary of this model by typing its name

model_in_spark

## Formula: Class ~ .
## 
## Coefficients:
##   (Intercept)          Time            V1            V2            V3 
## -8.305599e+00 -4.074154e-06  1.065118e-01  1.473891e-02 -8.426563e-03 
##            V4            V5            V6            V7            V8 
##  6.996793e-01  1.380980e-01 -1.217416e-01 -1.205822e-01 -1.700146e-01 
##            V9           V10           V11           V12           V13 
## -2.734966e-01 -8.277600e-01 -4.476393e-02  7.416858e-02 -2.828732e-01 
##           V14           V15           V16           V17           V18 
## -5.317753e-01 -1.221061e-01 -2.476344e-01 -1.591295e-03  3.403402e-02 
##           V19           V20           V21           V22           V23 
##  9.213132e-02 -4.914719e-01  3.863870e-01  6.407714e-01 -1.096256e-01 
##           V24           V25           V26           V27           V28 
##  1.366914e-01 -5.108841e-02  9.977837e-02 -8.384655e-01 -3.072630e-01 
##        Amount 
##  1.039041e-03

Fortunately, sparklyr also supports the functions of broom package so that We can get nicer table using the function tidy.

library(broom)

## Warning: package 'broom' was built under R version 4.0.2

tidy(model_in_spark)

## # A tibble: 31 x 2
##    features    coefficients
##    <chr>              <dbl>
##  1 (Intercept)  -8.31      
##  2 Time         -0.00000407
##  3 V1            0.107     
##  4 V2            0.0147    
##  5 V3           -0.00843   
##  6 V4            0.700     
##  7 V5            0.138     
##  8 V6           -0.122     
##  9 V7           -0.121     
## 10 V8           -0.170     
## # ... with 21 more rows

To evaluate the model performance we use the function ml_evaluate as follows

model_summary<-ml_evaluate(model_in_spark,train)
model_summary

## BinaryLogisticRegressionSummaryImpl 
##  Access the following via `$` or `ml_summary()`. 
##  - features_col() 
##  - label_col() 
##  - predictions() 
##  - probability_col() 
##  - area_under_roc() 
##  - f_measure_by_threshold() 
##  - pr() 
##  - precision_by_threshold() 
##  - recall_by_threshold() 
##  - roc() 
##  - prediction_col() 
##  - accuracy() 
##  - f_measure_by_label() 
##  - false_positive_rate_by_label() 
##  - labels() 
##  - precision_by_label() 
##  - recall_by_label() 
##  - true_positive_rate_by_label() 
##  - weighted_f_measure() 
##  - weighted_false_positive_rate() 
##  - weighted_precision() 
##  - weighted_recall() 
##  - weighted_true_positive_rate()

To extract the metric that we want we use $. we can extract for example the accuracy rate, the AUC or the roc

model_summary$area_under_roc()

## [1] 0.9765604

model_summary$accuracy()

## [1] 0.999149

model_summary$roc()

## # Source: spark<?> [?? x 2]
##        FPR   TPR
##      <dbl> <dbl>
##  1 0       0    
##  2 0.00849 0.876
##  3 0.0185  0.898
##  4 0.0285  0.908
##  5 0.0386  0.917
##  6 0.0487  0.922
##  7 0.0587  0.922
##  8 0.0688  0.925
##  9 0.0788  0.929
## 10 0.0888  0.934
## # ... with more rows

we can retrieve this table into R to plot it with ggplot by using the function collect

model_summary$roc()%>%
collect()%>%
ggplot(aes(FPR,TPR ))+
  geom_line(col="blue")+
  geom_abline(intercept = 0,slope = 1,col="red")+
  ggtitle("the roc of model_in_spark ")

High accuracy rate for the training set can be only the result of overfitting problem. the accuracy rate using the testing set is the more reliable one.

pred<-ml_evaluate(model_in_spark,test)
pred$accuracy()

## [1] 0.9994722

pred$area_under_roc()

## [1] 0.9692241

Finally, to get the prediction we use the function ml_predict

pred<-ml_predict(model_in_spark,test)%>%
select(.,Class,prediction,probability_0,probability_1)
pred

## # Source: spark<?> [?? x 4]
##    Class prediction probability_0 probability_1
##    <int>      <dbl>         <dbl>         <dbl>
##  1     0          0         1.00       0.000221
##  2     0          0         1.00       0.000441
##  3     0          0         1.00       0.000184
##  4     0          0         1.00       0.000490
##  5     0          0         1.00       0.000199
##  6     0          0         0.999      0.000708
##  7     0          0         1.00       0.000231
##  8     0          0         0.999      0.000640
##  9     0          0         1.00       0.000265
## 10     0          0         0.999      0.000720
## # ... with more rows

Here we can also use the function collect to plot the results

pred%>%
  collect()%>%
  ggplot(aes(Class,prediction ))+
  geom_point(size=0.1)+
  geom_jitter()+
  ggtitle("Actual vs predicted")

10 Streaming

Among the most powrful properties of spark is that can handle streaming data very easily. to show that let’s use a simple example by creating a folder to contain the input for some data transformations and then we save the output in another folder so that each time we add files to the first folder the above transformations will be excuted automotically and the output will be saved in the last folder.

#dir.create("raw_data")

once the file is created we split the data card into tow parts the first part will be exported now to the folder raw_data, and then we apply some operations using spark functions stream_read_csv and spark_wrirte_csv as follows .

#card1<-card%>%
  #filter(Time<=mean(Time,na.rm = TRUE))
#write.csv(card1,"raw_data/card1.csv")

#stream <- stream_read_csv(sc,"raw_data/")%>%
 # select(Class,Amount) %>%
#  stream_write_csv("result/")

If we add the second part in the file raw_data the streaming process lunch to execute the above operation.

#card2<-card%>%
 # filter(Time>mean(Time,na.rm = TRUE))
#write.csv(card,"raw_data/card2.csv")

#dir("result",pattern = ".csv")

we stop the stream

#stream_stop(stream)

sdf_describe(card)

## # Source: spark<?> [?? x 32]
##   summary Time  V1    V2    V3    V4    V5    V6    V7    V8    V9    V10  
##   <chr>   <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 count   2848~ 2848~ 2848~ 2848~ 2848~ 2848~ 2848~ 2848~ 2848~ 2848~ 2848~
## 2 mean    9481~ 1.75~ -8.2~ -9.6~ 8.32~ 1.64~ 4.24~ -3.0~ 8.81~ -1.1~ 7.09~
## 3 stddev  4748~ 1.95~ 1.65~ 1.51~ 1.41~ 1.38~ 1.33~ 1.23~ 1.19~ 1.09~ 1.08~
## 4 min     0     -56.~ -72.~ -48.~ -5.6~ -113~ -26.~ -43.~ -73.~ -13.~ -24.~
## 5 max     1727~ 2.45~ 22.0~ 9.38~ 16.8~ 34.8~ 73.3~ 120.~ 20.0~ 15.5~ 23.7~
## # ... with 20 more variables: V11 <chr>, V12 <chr>, V13 <chr>, V14 <chr>,
## #   V15 <chr>, V16 <chr>, V17 <chr>, V18 <chr>, V19 <chr>, V20 <chr>,
## #   V21 <chr>, V22 <chr>, V23 <chr>, V24 <chr>, V25 <chr>, V26 <chr>,
## #   V27 <chr>, V28 <chr>, Amount <chr>, Class <chr>

Now we disconnect

spark_disconnect(sc)

An example journal article

Tue, 01 Sep 2015 00:00:00 +0000

Click the Cite button above to demo the feature to enable visitors to import publication metadata into their reference management software.

Click the Slides button above to demo Academic’s Markdown slides feature.

Supplementary notes can be added here, including code and math.

An example conference paper

Mon, 01 Jul 2013 00:00:00 +0000

Click the Cite button above to demo the feature to enable visitors to import publication metadata into their reference management software.

Click the Slides button above to demo Academic’s Markdown slides feature.

Supplementary notes can be added here, including code and math.