<?xml version="1.0" encoding="utf-8" standalone="yes" ?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
  <channel>
    <title>Modeling with R</title>
    <link>https://modelingwithr.rbind.io/</link>
      <atom:link href="https://modelingwithr.rbind.io/index.xml" rel="self" type="application/rss+xml" />
    <description>Modeling with R</description>
    <generator>Source Themes Academic (https://sourcethemes.com/academic/)</generator><language>en-us</language><lastBuildDate>Sat, 01 Jun 2030 13:00:00 +0000</lastBuildDate>
    <image>
      <url>https://modelingwithr.rbind.io/images/icon_hu0b7a4cb9992c9ac0e91bd28ffd38dd00_9727_512x512_fill_lanczos_center_2.png</url>
      <title>Modeling with R</title>
      <link>https://modelingwithr.rbind.io/</link>
    </image>
    
    <item>
      <title>Example Talk</title>
      <link>https://modelingwithr.rbind.io/talk/example/</link>
      <pubDate>Sat, 01 Jun 2030 13:00:00 +0000</pubDate>
      <guid>https://modelingwithr.rbind.io/talk/example/</guid>
      <description>&lt;div class=&#34;alert alert-note&#34;&gt;
  &lt;div&gt;
    Click on the &lt;strong&gt;Slides&lt;/strong&gt; button above to view the built-in slides feature.
  &lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;Slides can be added in a few ways:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Create&lt;/strong&gt; slides using Academic&amp;rsquo;s 
&lt;a href=&#34;https://sourcethemes.com/academic/docs/managing-content/#create-slides&#34; target=&#34;_blank&#34; rel=&#34;noopener&#34;&gt;&lt;em&gt;Slides&lt;/em&gt;&lt;/a&gt; feature and link using &lt;code&gt;slides&lt;/code&gt; parameter in the front matter of the talk file&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Upload&lt;/strong&gt; an existing slide deck to &lt;code&gt;static/&lt;/code&gt; and link using &lt;code&gt;url_slides&lt;/code&gt; parameter in the front matter of the talk file&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Embed&lt;/strong&gt; your slides (e.g. Google Slides) or presentation video on this page using 
&lt;a href=&#34;https://sourcethemes.com/academic/docs/writing-markdown-latex/&#34; target=&#34;_blank&#34; rel=&#34;noopener&#34;&gt;shortcodes&lt;/a&gt;.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Further talk details can easily be added to this page using &lt;em&gt;Markdown&lt;/em&gt; and $\rm \LaTeX$ math code.&lt;/p&gt;
</description>
    </item>
    
    <item>
      <title>Predicting large text data with spark via the R package sparklyr</title>
      <link>https://modelingwithr.rbind.io/sparklyr/text_spark/predicting-large-text-data-with-spark-via-the-r-package-sparklyr/</link>
      <pubDate>Thu, 02 Jul 2020 00:00:00 +0000</pubDate>
      <guid>https://modelingwithr.rbind.io/sparklyr/text_spark/predicting-large-text-data-with-spark-via-the-r-package-sparklyr/</guid>
      <description>
&lt;script src=&#34;https://modelingwithr.rbind.io/rmarkdown-libs/header-attrs/header-attrs.js&#34;&gt;&lt;/script&gt;

&lt;div id=&#34;TOC&#34;&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&#34;#abstract&#34;&gt;&lt;span class=&#34;toc-section-number&#34;&gt;1&lt;/span&gt; Abstract&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#keywords&#34;&gt;&lt;span class=&#34;toc-section-number&#34;&gt;2&lt;/span&gt; Keywords&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#introduction&#34;&gt;&lt;span class=&#34;toc-section-number&#34;&gt;3&lt;/span&gt; Introduction&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#data-preparation&#34;&gt;&lt;span class=&#34;toc-section-number&#34;&gt;4&lt;/span&gt; Data preparation&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#tf-model&#34;&gt;&lt;span class=&#34;toc-section-number&#34;&gt;5&lt;/span&gt; TF model&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#tf-idf-model&#34;&gt;&lt;span class=&#34;toc-section-number&#34;&gt;6&lt;/span&gt; TF-IDF model&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#add-new-features&#34;&gt;&lt;span class=&#34;toc-section-number&#34;&gt;7&lt;/span&gt; Add new features&lt;/a&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&#34;#tf-model-1&#34;&gt;&lt;span class=&#34;toc-section-number&#34;&gt;7.1&lt;/span&gt; TF model&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#tf_idf-model&#34;&gt;&lt;span class=&#34;toc-section-number&#34;&gt;7.2&lt;/span&gt; tf_idf model&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#n-gram-model&#34;&gt;&lt;span class=&#34;toc-section-number&#34;&gt;8&lt;/span&gt; n-gram model&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#conclusion&#34;&gt;&lt;span class=&#34;toc-section-number&#34;&gt;9&lt;/span&gt; Conclusion:&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#references&#34;&gt;&lt;span class=&#34;toc-section-number&#34;&gt;10&lt;/span&gt; References&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#session-information&#34;&gt;&lt;span class=&#34;toc-section-number&#34;&gt;11&lt;/span&gt; Session information&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;

&lt;style type=&#34;text/css&#34;&gt;
.main-container {
  max-width: auto;
  margin-left: 2.5em;
  margin-right: 2.5em;
}
&lt;/style&gt;
&lt;style type=&#34;text/css&#34;&gt;
strong {
  color: Navy;
}

h1,h2, h3, h4 {
  font-size:28px;
  color:DarkBlue;
}
&lt;/style&gt;
&lt;div id=&#34;abstract&#34; class=&#34;section level1&#34; number=&#34;1&#34;&gt;
&lt;h1&gt;&lt;span class=&#34;header-section-number&#34;&gt;1&lt;/span&gt; Abstract&lt;/h1&gt;
&lt;p&gt;Unlike the classical programming languages that are very slow and even sometimes fail to load very large data sets since they use only a single core, Apache Spark is known as the fastest distributed system that can handle with ease large datasets by deploying all the available machines and cores to build cluster, so that the computing time of each task performed on the data will be drastically reduced since each &lt;strong&gt;worker node&lt;/strong&gt; in the cluster takes in charge small part of the task in question. Even that the native language of spark is &lt;strong&gt;scala&lt;/strong&gt; (but it can also support &lt;strong&gt;java&lt;/strong&gt; and &lt;strong&gt;sql&lt;/strong&gt;), the good news for R users is that they can benefit from spark without having to learn the above supported languages by making use of the R package &lt;strong&gt;sparklyr&lt;/strong&gt;. In this article we trained random forest model using text data which is in practice known as large data set. for illustration purposes and to make things faster however we used a small data set about email messages and also constrained ourselves to use the &lt;strong&gt;local mode&lt;/strong&gt; in which spark created a cluster from the available cores in my machine. Notice that the same codes in this paper can be used in the cloud whatever the size of our data, even with billions of data points, except for the connection method to spark which is slightly different. Since the raw data requires some transformation to be consumed by the model, we applied the well-known method called &lt;strong&gt;tokenization&lt;/strong&gt; to create the model features, then trained and evaluated a random forest model applied on the design matrix after having been filled using the &lt;strong&gt;TF&lt;/strong&gt; method. Lastly, we trained the same model (random forest model with the same hyperparameter values) using another method called &lt;strong&gt;TF-IDF&lt;/strong&gt; method (Sparck , 1972).&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;keywords&#34; class=&#34;section level1&#34; number=&#34;2&#34;&gt;
&lt;h1&gt;&lt;span class=&#34;header-section-number&#34;&gt;2&lt;/span&gt; Keywords&lt;/h1&gt;
&lt;p&gt;Large dataset, R, spark, sparklyr, cluster, tokenization, TF, TF-IDF, random forest model, machine learning.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;introduction&#34; class=&#34;section level1&#34; number=&#34;3&#34;&gt;
&lt;h1&gt;&lt;span class=&#34;header-section-number&#34;&gt;3&lt;/span&gt; Introduction&lt;/h1&gt;
&lt;p&gt;R is one of the best programming languages for statistical analysis, and provides &lt;strong&gt;data scientist&lt;/strong&gt; by super powerful tools that make their work super easy and more exciting. However, since the amount of information today is growing exponentially, R and all the classical languages (python, java,…etc.) that use one single machine (one single core node) would face a great challenges to handle and deal with large dataset that, in some cases, its size can even exceed the memory size.
As a solution to the above classical programming language limitations, &lt;strong&gt;spark&lt;/strong&gt; and &lt;strong&gt;hadoop&lt;/strong&gt; are two new systems. Both use a computing distributed system that run multiple tasks using multiple machines (called &lt;strong&gt;nodes&lt;/strong&gt;, and together called &lt;strong&gt;cluster&lt;/strong&gt;) at the same time. However, spark has the superiority over hadoop by its ability to load the data in memory which makes it much higher faster (Luraschi, 2014).
Spark creates a cluster using either physical machines or virtual machines provided by some &lt;strong&gt;cloud&lt;/strong&gt; provider such as google, amazon, microsoft…etc (it can also creat a cluster using the available cores in a single machine known as &lt;strong&gt;local mode&lt;/strong&gt;). Its native language is scala, but also can support sql and java. Thankfully, spark provides a high level APIs in &lt;strong&gt;python&lt;/strong&gt; and &lt;strong&gt;R&lt;/strong&gt; so that the R users can use spark as a platform to work with large datasets using their familiar codes and without having to learn scala, sql or java. However, the connection between R and spark is not straightforward, it is set by the help of &lt;strong&gt;sparklyr&lt;/strong&gt; package, which is like any other R packages, with its own functions and supports almost all the famous &lt;strong&gt;dplyr&lt;/strong&gt; R package functions.
Usually, most of text data are considered as large datasets, either due to their large sizes or the large computing time required for their manipulations or modulizations. That is why, in this paper, we will train &lt;strong&gt;Random forest model&lt;/strong&gt; using sparklyr to predict whether a text message is spam or ham from the data set &lt;strong&gt;SMSSpamCollection&lt;/strong&gt; uploaded from &lt;strong&gt;kaggle&lt;/strong&gt; website. To convert the character features to numeric type we will use two famous methods , TF transformation, and TF-IDF (Jones, 1972) transformation.
This article will be divided into the following sections:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Data Preparation: we will illustrate how do we read, clean, and prepare the data to be consumed by the model.&lt;/li&gt;
&lt;li&gt;TF Method: we will train a random forest model (James et al, 2013) on the term frequency TF features.&lt;/li&gt;
&lt;li&gt;TF-IDF method: We will train the random forest model on the TF_IDF features.&lt;/li&gt;
&lt;li&gt;Add features: we will create another feature from the data to be used as a new predictor.&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
&lt;div id=&#34;data-preparation&#34; class=&#34;section level1&#34; number=&#34;4&#34;&gt;
&lt;h1&gt;&lt;span class=&#34;header-section-number&#34;&gt;4&lt;/span&gt; Data preparation&lt;/h1&gt;
&lt;p&gt;First, we call the R packages &lt;strong&gt;tidyverse&lt;/strong&gt; and &lt;strong&gt;sparklyr&lt;/strong&gt;, and we set up the connection to spark using the following R codes.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;suppressPackageStartupMessages(library(sparklyr))
suppressPackageStartupMessages(library(tidyverse))
sc&amp;lt;-spark_connect(master = &amp;quot;local&amp;quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Second, we call the data that has been uploaded and saved in my R directory (notice that the data does not have column headers), and we use the glimpse function to get a first glance.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;path &amp;lt;- &amp;quot;C://Users/dell/Documents/SMSSpamCollection.txt&amp;quot;
mydata&amp;lt;-spark_read_csv(sc,name=&amp;quot;SMS&amp;quot;,path=path, header=FALSE,delimiter = &amp;quot;\t&amp;quot;,overwrite = TRUE)
knitr::kable(head(mydata,3))&lt;/code&gt;&lt;/pre&gt;
&lt;table&gt;
&lt;colgroup&gt;
&lt;col width=&#34;3%&#34; /&gt;
&lt;col width=&#34;96%&#34; /&gt;
&lt;/colgroup&gt;
&lt;thead&gt;
&lt;tr class=&#34;header&#34;&gt;
&lt;th align=&#34;left&#34;&gt;V1&lt;/th&gt;
&lt;th align=&#34;left&#34;&gt;V2&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;left&#34;&gt;ham&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;Go until jurong point, crazy.. Available only in bugis n great world la e buffet… Cine there got amore wat…&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;left&#34;&gt;ham&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;Ok lar… Joking wif u oni…&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;left&#34;&gt;spam&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&amp;amp;C’s apply 08452810075over18’s&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;It will be more practical if we replace the default column names V1 and V2 by Labels and messages respectively.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;names(mydata)&amp;lt;-c(&amp;quot;labels&amp;quot;,&amp;quot;messages&amp;quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;we can get the dimension of this data by using the function &lt;strong&gt;sdf_dim&lt;/strong&gt;&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;sdf_dim(mydata)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;[1] 5574    2&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;We can also take a look at some messages by displaying the first three rows.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;select(mydata,messages)%&amp;gt;%
  head(3) %&amp;gt;% 
  knitr::kable(&amp;quot;html&amp;quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th style=&#34;text-align:left;&#34;&gt;
messages
&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
Go until jurong point, crazy.. Available only in bugis n great world la e buffet… Cine there got amore wat…
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
Ok lar… Joking wif u oni…
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&amp;amp;C’s apply 08452810075over18’s
&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;Modeling text data requires special attention since most of the machine learning algorithms require numeric data, so how do we can transform the text entries in &lt;strong&gt;messages&lt;/strong&gt; into numeric type?.
The most well known approach is called &lt;strong&gt;tokenization&lt;/strong&gt;, this simply means splitting each text in the column &lt;strong&gt;messages&lt;/strong&gt; into small pieces called &lt;strong&gt;tokens&lt;/strong&gt; (also called bag of words) in a way such that each token has meaningful effect to discriminating between the dependent variable &lt;strong&gt;labels&lt;/strong&gt;. For example, if we think that arbitrary numbers or some symbols like / or dots…etc. do not have any discriminating impact then we can remove them from the entries.
Each row in this data (which is labeled as ham or spam ) is considered as &lt;strong&gt;document&lt;/strong&gt; ( 5574 documents in our case) that has a text (which is a collection of tokens), and the whole data after tokenization (as a rectangular matrix) is called &lt;strong&gt;corpus&lt;/strong&gt;.
To keep things simple let’s suppose that everything except the words are useless for predicting the labels, so we can use the function spark sql function &lt;strong&gt;regexp_replace&lt;/strong&gt; to remove everything except letters, then we rename the resulted column &lt;strong&gt;cleaned&lt;/strong&gt;.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;newdata&amp;lt;-mydata%&amp;gt;%
  mutate(cleaned=regexp_replace(messages,&amp;quot;[^a-zA-Z]&amp;quot;,&amp;quot; &amp;quot;))%&amp;gt;%
  mutate(cleaned=lower(cleaned))%&amp;gt;%
  select(labels,cleaned)
newdata%&amp;gt;%
  select(cleaned)%&amp;gt;%
  head(3)%&amp;gt;%
  knitr::kable()&lt;/code&gt;&lt;/pre&gt;
&lt;table&gt;
&lt;colgroup&gt;
&lt;col width=&#34;100%&#34; /&gt;
&lt;/colgroup&gt;
&lt;thead&gt;
&lt;tr class=&#34;header&#34;&gt;
&lt;th align=&#34;left&#34;&gt;cleaned&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;left&#34;&gt;go until jurong point crazy available only in bugis n great world la e buffet cine there got amore wat&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;left&#34;&gt;ok lar joking wif u oni&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;left&#34;&gt;free entry in a wkly comp to win fa cup final tkts st may text fa to to receive entry question std txt rate t c s apply over s&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;At this stage and before going ahead we should split the data between training set and testing set. However, since we have an imbalanced data with roughly 87% of ham’s and 13% of spam’s, we should preserve the proportion of the labels by splitting the data in a such way to get stratified samples.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;newdata%&amp;gt;%
  group_by(labels)%&amp;gt;%
  count()%&amp;gt;%
  collect()%&amp;gt;%
  mutate(prop=n/sum(n))%&amp;gt;%
  knitr::kable()&lt;/code&gt;&lt;/pre&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr class=&#34;header&#34;&gt;
&lt;th align=&#34;left&#34;&gt;labels&lt;/th&gt;
&lt;th align=&#34;right&#34;&gt;n&lt;/th&gt;
&lt;th align=&#34;right&#34;&gt;prop&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;left&#34;&gt;ham&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;4827&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0.8659849&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;left&#34;&gt;spam&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;747&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0.1340151&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;To accomplish this task by hand, first we filter the data between ham and spam, then each set will be split randomly between training set and testing set, and next we rbind together the training sets in one set and then we do the same thing for testing sets.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;dataham&amp;lt;-newdata%&amp;gt;%
  filter(labels==&amp;quot;ham&amp;quot;)
dataspam&amp;lt;-newdata%&amp;gt;%
  filter(labels==&amp;quot;spam&amp;quot;)
partitionham&amp;lt;-dataham%&amp;gt;%
  sdf_random_split(training=0.8,test=0.2,seed = 111)
partitionspam&amp;lt;-dataspam%&amp;gt;%
  sdf_random_split(training=0.8,test=0.2,seed = 111)

train&amp;lt;-sdf_bind_rows(partitionham$training,partitionspam$training)%&amp;gt;%
  compute(&amp;quot;train&amp;quot;)
test&amp;lt;-sdf_bind_rows(partitionham$test,partitionspam$test)%&amp;gt;%
  compute(&amp;quot;test&amp;quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div id=&#34;tf-model&#34; class=&#34;section level1&#34; number=&#34;5&#34;&gt;
&lt;h1&gt;&lt;span class=&#34;header-section-number&#34;&gt;5&lt;/span&gt; TF model&lt;/h1&gt;
&lt;p&gt;Since machine learning models require inputs as numeric data, the common practice in text analysis thus is to convert each single text into &lt;strong&gt;tokens&lt;/strong&gt; (or pieces) so that these tokens will be the features that can be used to discriminate between class labels, In our case, they are a simple words. Using the &lt;strong&gt;TF&lt;/strong&gt; method, if a particular word exists in a particular document we assign the number of frequency of this word (or just 1 if we do not care about the frequency) in the corresponding cell in the design matrix (which is called Document Term Matrix &lt;strong&gt;DTM&lt;/strong&gt;), otherwise we assign zero.
this method will give us a very large and sparse rectangular matrix with huge number of features compared to the number of documents, that is why spark can help to handle this type of data.
Due to its popularity, we will fit random forest model, which known as one of the most powerful machine learning models, to the transformed data. to be brief We will make use of the spark feature &lt;strong&gt;pipline&lt;/strong&gt; that helps us to group all the following required steps to enable running the model:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;convert the dependent variable labels to integer type.&lt;/li&gt;
&lt;li&gt;tokenize the cleaned messages into words (tokens).&lt;/li&gt;
&lt;li&gt;remove stop words from the tokens since they tend to spread out randomly among documents.&lt;/li&gt;
&lt;li&gt;replace each term in each document by its frequency number.&lt;/li&gt;
&lt;li&gt;define the model that will be used (here random forest model).&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;At the final step we use &lt;strong&gt;ml_random_forest&lt;/strong&gt; function and we keep all the default values, for example, 20 for number of trees, 5 for the max depth, and &lt;strong&gt;gini&lt;/strong&gt; as the impurity function, and do not forget to set the seed to get the result reproducible. lastly we call the &lt;strong&gt;ml_fit&lt;/strong&gt; function to fit the model.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;pipline&amp;lt;-ml_pipeline(sc)%&amp;gt;%
  ft_string_indexer(input_col = &amp;quot;labels&amp;quot;,output_col=&amp;quot;class&amp;quot;)%&amp;gt;%
  ft_tokenizer(input_col = &amp;quot;cleaned&amp;quot;, output_col=&amp;quot;words&amp;quot;)%&amp;gt;%
  ft_stop_words_remover(input_col = &amp;quot;words&amp;quot;,output_col = &amp;quot;cleaned_words&amp;quot;)%&amp;gt;%
  ft_count_vectorizer(input_col = &amp;quot;cleaned_words&amp;quot;, output_col=&amp;quot;terms&amp;quot;,
                      min_df=5,binary=TRUE)%&amp;gt;%
  ft_vector_assembler(input_cols = &amp;quot;terms&amp;quot;,output_col=&amp;quot;features&amp;quot;)%&amp;gt;%
  ml_random_forest_classifier(label_col=&amp;quot;class&amp;quot;,
                 features_col=&amp;quot;features&amp;quot;,
                 seed=222)
model_rf&amp;lt;-ml_fit(pipline,train)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;To evaluate our model we use the &lt;strong&gt;ml_transfrom function&lt;/strong&gt;.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;ml_transform(model_rf,train)%&amp;gt;%
  ml_binary_classification_evaluator(label_col = &amp;quot;class&amp;quot;,
                                     metric_name= &amp;quot;areaUnderROC&amp;quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;[1] 0.9693865&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Notice that in binary classification model sparklyr provides only two metrics &lt;strong&gt;areaUnderROC&lt;/strong&gt; and &lt;strong&gt;areaUnderPR&lt;/strong&gt; (Murphy, 2012). Using the former metric we get high score which is about 0.97.
This rate is ranged between 0 and 1, The higher the rate the best the model. However, since this rate is resulted from the training data, it might be the result of an overfitting (Lantz, 2016) problem, that is why the more reliable one is that that resulted from the testing set, , which is now 0.976.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;ml_transform(model_rf,test)%&amp;gt;%
  ml_binary_classification_evaluator(label_col = &amp;quot;class&amp;quot;,
                                     metric_name= &amp;quot;areaUnderROC&amp;quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;[1] 0.9653819&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Fortunately The two rate values are very close to each other indicating the good generalization of our model.&lt;br /&gt;
To get the prediction we use the &lt;strong&gt;ml_predict&lt;/strong&gt; function .&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;pred&amp;lt;-ml_predict(model_rf,test)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;As we see some columns are nested. This is not problem since you can extract the elements of this list using the function &lt;strong&gt;unlist&lt;/strong&gt;. For instance, we can show the most used words in each class label using the package wordcloud&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;p1&amp;lt;-pred%&amp;gt;%
  filter(labels==&amp;quot;ham&amp;quot;)%&amp;gt;%
  pull(cleaned_words)%&amp;gt;%
  unlist()
wordcloud::wordcloud(p1,max.words = 50, random.order = FALSE,
                     colors=c(&amp;quot;blue&amp;quot;,&amp;quot;red&amp;quot;,&amp;quot;green&amp;quot;,&amp;quot;yellow&amp;quot;),random.color = TRUE)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://modelingwithr.rbind.io/sparklyr/text_spark/2020-07-02-predicting-large-text-data-with-spark-via-the-r-package-sparklyr.en_files/figure-html/unnamed-chunk-14-1.svg&#34; width=&#34;576&#34; /&gt;&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;p2&amp;lt;-pred%&amp;gt;%
  filter(labels==&amp;quot;spam&amp;quot;)%&amp;gt;%
  pull(cleaned_words)%&amp;gt;%
  unlist()
wordcloud::wordcloud(p2,max.words = 50,random.order = FALSE, 
                     colors=c(&amp;quot;blue&amp;quot;,&amp;quot;red&amp;quot;,&amp;quot;green&amp;quot;,&amp;quot;yellow&amp;quot;),random.color = TRUE)  &lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://modelingwithr.rbind.io/sparklyr/text_spark/2020-07-02-predicting-large-text-data-with-spark-via-the-r-package-sparklyr.en_files/figure-html/unnamed-chunk-14-2.svg&#34; width=&#34;576&#34; /&gt;&lt;/p&gt;
&lt;p&gt;From the upper figure we see that the most common words in hm’s are: get, good, know, whereas the lower figure shows the most ones for spam’s, which are: call, free, mobile. This means that if we receive a new email message that has the word free for instance , it will be more probable to be spam.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;tf-idf-model&#34; class=&#34;section level1&#34; number=&#34;6&#34;&gt;
&lt;h1&gt;&lt;span class=&#34;header-section-number&#34;&gt;6&lt;/span&gt; TF-IDF model&lt;/h1&gt;
&lt;p&gt;The main drawback of TF method is that it does not take into account the distribution of each term across the documents that reflects how much information each term provides. To measure the information of each term we compute its &lt;strong&gt;DF&lt;/strong&gt; document frequency value which is the number of documents &lt;strong&gt;d&lt;/strong&gt; where the term &lt;strong&gt;t&lt;/strong&gt; appears, and hence the inverse document frequency &lt;strong&gt;IDF&lt;/strong&gt; value for each pair (d,t) will be computed as follows:&lt;/p&gt;
&lt;p&gt;&lt;span class=&#34;math display&#34;&gt;\[idf(t,d)=log(\frac{N}{1+|d\epsilon D,t\epsilon d|})\]&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;Where N is the total number of documents (number of rows).
By multiplying TF with IDF we get TF-IDF value for each term. In the above TF pipline we include the function ft_idf , then we fit again random forest model on the transformed data, and we evaluate the model directly by using the test data.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;pipline2&amp;lt;-ml_pipeline(sc)%&amp;gt;%
  ft_string_indexer(input_col = &amp;quot;labels&amp;quot;,output_col=&amp;quot;class&amp;quot;)%&amp;gt;%
  ft_tokenizer(input_col = &amp;quot;cleaned&amp;quot;, output_col=&amp;quot;words&amp;quot;)%&amp;gt;%
  ft_stop_words_remover(input_col = &amp;quot;words&amp;quot;,output_col = &amp;quot;cleaned_words&amp;quot;)%&amp;gt;%
  ft_count_vectorizer(input_col = &amp;quot;cleaned_words&amp;quot;, output_col=&amp;quot;tf_terms&amp;quot;)%&amp;gt;%
  ft_idf(input_col = &amp;quot;tf_terms&amp;quot;, output_col=&amp;quot;tfidf_terms&amp;quot;)%&amp;gt;%
    ml_random_forest_classifier(label_col=&amp;quot;class&amp;quot;,
                 features_col=&amp;quot;tfidf_terms&amp;quot;,
                 seed=222)

model_rf.tfidf &amp;lt;- ml_fit(pipline2, train)

ml_transform(model_rf.tfidf,test)%&amp;gt;%
  ml_binary_classification_evaluator(label_col = &amp;quot;class&amp;quot;,
                                     metric_name= &amp;quot;areaUnderROC&amp;quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## [1] 0.953212&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Using this more complex model than the previous one is not justified for this data since their rates are close to each other.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;add-new-features&#34; class=&#34;section level1&#34; number=&#34;7&#34;&gt;
&lt;h1&gt;&lt;span class=&#34;header-section-number&#34;&gt;7&lt;/span&gt; Add new features&lt;/h1&gt;
&lt;p&gt;Customizing new features from the data that we think they are more relevant than the old ones is a popular strategy used to improve prediction quality. For example, with our data we think that spam messages tend to be shorter than ham messages, we can, thus, add the messages’ lengths as new features.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;train1 &amp;lt;- train %&amp;gt;% mutate(lengths=nchar(cleaned))
test1 &amp;lt;- test %&amp;gt;% mutate(lengths=nchar(cleaned))&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Now let’s retrain the above models again with this new added feature.&lt;/p&gt;
&lt;div id=&#34;tf-model-1&#34; class=&#34;section level2&#34; number=&#34;7.1&#34;&gt;
&lt;h2&gt;&lt;span class=&#34;header-section-number&#34;&gt;7.1&lt;/span&gt; TF model&lt;/h2&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;pipline_tf&amp;lt;-ml_pipeline(sc)%&amp;gt;%
  ft_string_indexer(input_col = &amp;quot;labels&amp;quot;,output_col=&amp;quot;class&amp;quot;)%&amp;gt;%
  ft_tokenizer(input_col = &amp;quot;cleaned&amp;quot;, output_col=&amp;quot;words&amp;quot;)%&amp;gt;%
  ft_stop_words_remover(input_col = &amp;quot;words&amp;quot;,output_col = &amp;quot;cleaned_words&amp;quot;)%&amp;gt;%
  ft_count_vectorizer(input_col = &amp;quot;cleaned_words&amp;quot;, output_col=&amp;quot;terms&amp;quot;,
                      min_df=5,binary=TRUE)%&amp;gt;%
  ft_vector_assembler(input_cols = c(&amp;quot;terms&amp;quot;,&amp;quot;lengths&amp;quot;),output_col=&amp;quot;features&amp;quot;)%&amp;gt;%
  ml_random_forest_classifier(label_col=&amp;quot;class&amp;quot;,
                 features_col=&amp;quot;features&amp;quot;,
                 seed=222)

model_rf_new&amp;lt;-ml_fit(pipline_tf,train1)
ml_transform(model_rf_new,test1)%&amp;gt;%
  ml_binary_classification_evaluator(label_col = &amp;quot;class&amp;quot;,
                                     metric_name= &amp;quot;areaUnderROC&amp;quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## [1] 0.9849365&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Fortunately, our expectation about this new feature is confirmed since we have got a significant improvement compared to the previous results.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;tf_idf-model&#34; class=&#34;section level2&#34; number=&#34;7.2&#34;&gt;
&lt;h2&gt;&lt;span class=&#34;header-section-number&#34;&gt;7.2&lt;/span&gt; tf_idf model&lt;/h2&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;pipline_tfidf&amp;lt;-ml_pipeline(sc)%&amp;gt;%
  ft_string_indexer(input_col = &amp;quot;labels&amp;quot;,output_col=&amp;quot;class&amp;quot;)%&amp;gt;%
  ft_tokenizer(input_col = &amp;quot;cleaned&amp;quot;, output_col=&amp;quot;words&amp;quot;)%&amp;gt;%
  ft_stop_words_remover(input_col = &amp;quot;words&amp;quot;,output_col = &amp;quot;cleaned_words&amp;quot;)%&amp;gt;%
  ft_count_vectorizer(input_col = &amp;quot;cleaned_words&amp;quot;, output_col=&amp;quot;tf_terms&amp;quot;)%&amp;gt;%
  ft_idf(input_col = &amp;quot;tf_terms&amp;quot;, output_col=&amp;quot;tfidf_terms&amp;quot;)%&amp;gt;%
  ft_vector_assembler(input_cols = c(&amp;quot;tfidf_terms&amp;quot;,&amp;quot;lengths&amp;quot;),output_col=&amp;quot;features&amp;quot;)%&amp;gt;%
    ml_random_forest_classifier(label_col=&amp;quot;class&amp;quot;,
                 features_col=&amp;quot;features&amp;quot;,
                 seed=222)

model_rf_new2 &amp;lt;- ml_fit(pipline_tfidf, train1)

ml_transform(model_rf_new2,test1)%&amp;gt;%
  ml_binary_classification_evaluator(label_col = &amp;quot;class&amp;quot;,
                                     metric_name= &amp;quot;areaUnderROC&amp;quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## [1] 0.9857918&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Again, as we said before, the use of idf method is not justified, and it would be better to stay with the tf method.&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div id=&#34;n-gram-model&#34; class=&#34;section level1&#34; number=&#34;8&#34;&gt;
&lt;h1&gt;&lt;span class=&#34;header-section-number&#34;&gt;8&lt;/span&gt; n-gram model&lt;/h1&gt;
&lt;p&gt;In contrast to the function &lt;strong&gt;ft_tokenizer&lt;/strong&gt; that splits the text into tokens where each token has a single word, each token resulted from the sparklyr function &lt;strong&gt;ft_ngram&lt;/strong&gt; has n words respecting the same appearance order as in the original text.
To well understand let’s take the following example.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;data &amp;lt;- copy_to(sc, data.frame(x=&amp;quot;I like both R and python&amp;quot;), overwrite = TRUE)
data&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## # Source: spark&amp;lt;?&amp;gt; [?? x 1]
##   x                       
##   &amp;lt;chr&amp;gt;                   
## 1 I like both R and python&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;the &lt;strong&gt;ft_tokenizer&lt;/strong&gt; function gives the following tokens:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;ft_tokenizer(data, &amp;quot;x&amp;quot;, &amp;quot;y&amp;quot;) %&amp;gt;% 
  mutate(y1=explode(y)) %&amp;gt;% select(y1)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## # Source: spark&amp;lt;?&amp;gt; [?? x 1]
##   y1    
##   &amp;lt;chr&amp;gt; 
## 1 i     
## 2 like  
## 3 both  
## 4 r     
## 5 and   
## 6 python&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Whereas, with &lt;strong&gt;ft_ngram&lt;/strong&gt;, where &lt;span class=&#34;math inline&#34;&gt;\(n=2\)&lt;/span&gt; we get the following tokens&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;data  %&amp;gt;%  ft_tokenizer(&amp;quot;x&amp;quot;, &amp;quot;y&amp;quot;) %&amp;gt;% 
  ft_ngram(&amp;quot;y&amp;quot;, &amp;quot;y1&amp;quot;, n=2) %&amp;gt;%
  mutate(z=explode(y1)) %&amp;gt;% 
  select(z)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## # Source: spark&amp;lt;?&amp;gt; [?? x 1]
##   z         
##   &amp;lt;chr&amp;gt;     
## 1 i like    
## 2 like both 
## 3 both r    
## 4 r and     
## 5 and python&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Now let’s train 2_gram Random forest model.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;pipline_2gram&amp;lt;-ml_pipeline(sc)%&amp;gt;%
  ft_string_indexer(input_col = &amp;quot;labels&amp;quot;,output_col=&amp;quot;class&amp;quot;)%&amp;gt;%
  ft_tokenizer(input_col = &amp;quot;cleaned&amp;quot;, output_col=&amp;quot;words&amp;quot;)%&amp;gt;%
  ft_stop_words_remover(input_col = &amp;quot;words&amp;quot;,output_col = &amp;quot;cleaned_words&amp;quot;)%&amp;gt;%
  ft_ngram(input_col = &amp;quot;cleaned_words&amp;quot;, output_col=&amp;quot;ngram_words&amp;quot;, n=2) %&amp;gt;% 
  ft_count_vectorizer(input_col = &amp;quot;ngram_words&amp;quot;, output_col=&amp;quot;tf_terms&amp;quot;)%&amp;gt;%
  ft_vector_assembler(input_cols = c(&amp;quot;tf_terms&amp;quot;,&amp;quot;lengths&amp;quot;),output_col=&amp;quot;features&amp;quot;)%&amp;gt;%
  ml_random_forest_classifier(label_col=&amp;quot;class&amp;quot;,
                 features_col=&amp;quot;features&amp;quot;,
                 seed=222)

model_rf_2gram &amp;lt;- ml_fit(pipline_2gram, train1)

ml_transform(model_rf_2gram,test1)%&amp;gt;%
  ml_binary_classification_evaluator(label_col = &amp;quot;class&amp;quot;,
                                     metric_name= &amp;quot;areaUnderROC&amp;quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## [1] 0.8835537&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;You should know that this function takes only tokens with tow words exactly, not tokens with less or equal 2 words. That is why we have obtained a lower rate than the previous models.&lt;/p&gt;
&lt;p&gt;When you are satisfied by your final model, you can save it for further use as follows.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;#ml_save(model_rf_ngram,&amp;quot;spark_ngram&amp;quot;,overwrite = TRUE)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The last thing to mention, is when you finish your work do not forget to free your resources by disconnecting from spark as follows&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;spark_disconnect(sc)&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div id=&#34;conclusion&#34; class=&#34;section level1&#34; number=&#34;9&#34;&gt;
&lt;h1&gt;&lt;span class=&#34;header-section-number&#34;&gt;9&lt;/span&gt; Conclusion:&lt;/h1&gt;
&lt;p&gt;This article is a brief introduction to illustrate how easy to handle and model large data set with the combination of the two powerful languages R and spark. we have used a text data set since this type of data that characterizes the most large datasets encountered in the real world.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;references&#34; class=&#34;section level1&#34; number=&#34;10&#34;&gt;
&lt;h1&gt;&lt;span class=&#34;header-section-number&#34;&gt;10&lt;/span&gt; References&lt;/h1&gt;
&lt;ul&gt;
&lt;li&gt;Brett Lantz (2016). Machine learning with R. packet publishing. Second edition. ISBN 97-8-1-78439-390-8.&lt;/li&gt;
&lt;li&gt;Garet James et al (2013) , An introduction to statistical learning, springer, ISBN 978-1-4614-7138-7.&lt;/li&gt;
&lt;li&gt;Javier Luraschi (2014). Mastering spark with R. O’reilly. &lt;a href=&#34;https://therinspark.com/intro.html&#34; class=&#34;uri&#34;&gt;https://therinspark.com/intro.html&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;Kevin P,Murphy (2012). Machine learning: A probabilistic perspective. The MIT press, ISBN 978-0-262-01802-9.&lt;/li&gt;
&lt;li&gt;Spark Jones.K (1972). A Statistical Interpretation of Term Specificity and Its Application in Retrieval. Journal of Documentation. 28: 11–21.&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://www.kaggle.com/team-ai/spam-text-message-classification&#34; class=&#34;uri&#34;&gt;https://www.kaggle.com/team-ai/spam-text-message-classification&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://www.tidyverse.org/packages/https://rstudio.com/wp-content/uploads/2016/09/RegExCheatsheet.pdf&#34; class=&#34;uri&#34;&gt;https://www.tidyverse.org/packages/https://rstudio.com/wp-content/uploads/2016/09/RegExCheatsheet.pdf&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
&lt;div id=&#34;session-information&#34; class=&#34;section level1&#34; number=&#34;11&#34;&gt;
&lt;h1&gt;&lt;span class=&#34;header-section-number&#34;&gt;11&lt;/span&gt; Session information&lt;/h1&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;sessionInfo()&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## R version 4.0.1 (2020-06-06)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 19041)
## 
## Matrix products: default
## 
## locale:
## [1] LC_COLLATE=English_United States.1252 
## [2] LC_CTYPE=English_United States.1252   
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C                          
## [5] LC_TIME=English_United States.1252    
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] forcats_0.5.0   stringr_1.4.0   dplyr_1.0.2     purrr_0.3.4    
##  [5] readr_1.3.1     tidyr_1.1.2     tibble_3.0.3    ggplot2_3.3.2  
##  [9] tidyverse_1.3.0 sparklyr_1.4.0 
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_1.0.5         lubridate_1.7.9    forge_0.2.0        utf8_1.1.4        
##  [5] assertthat_0.2.1   rprojroot_1.3-2    digest_0.6.25      slam_0.1-47       
##  [9] R6_2.4.1           cellranger_1.1.0   backports_1.1.10   reprex_0.3.0      
## [13] evaluate_0.14      httr_1.4.2         highr_0.8          blogdown_0.20     
## [17] pillar_1.4.6       rlang_0.4.7        readxl_1.3.1       uuid_0.1-4        
## [21] rstudioapi_0.11    blob_1.2.1         rmarkdown_2.4      config_0.3        
## [25] r2d3_0.2.3         htmlwidgets_1.5.2  munsell_0.5.0      broom_0.7.1       
## [29] compiler_4.0.1     modelr_0.1.8       xfun_0.18          pkgconfig_2.0.3   
## [33] askpass_1.1        base64enc_0.1-3    htmltools_0.5.0    openssl_1.4.3     
## [37] tidyselect_1.1.0   bookdown_0.20      fansi_0.4.1        crayon_1.3.4      
## [41] dbplyr_1.4.4       withr_2.3.0        grid_4.0.1         jsonlite_1.7.1    
## [45] gtable_0.3.0       lifecycle_0.2.0    DBI_1.1.0          magrittr_1.5      
## [49] scales_1.1.1       cli_2.0.2          stringi_1.5.3      fs_1.5.0          
## [53] NLP_0.2-0          xml2_1.3.2         ellipsis_0.3.1     generics_0.0.2    
## [57] vctrs_0.3.4        wordcloud_2.6      RColorBrewer_1.1-2 tools_4.0.1       
## [61] glue_1.4.2         hms_0.5.3          parallel_4.0.1     yaml_2.2.1        
## [65] tm_0.7-7           colorspace_1.4-1   rvest_0.3.6        knitr_1.30        
## [69] haven_2.3.1&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
</description>
    </item>
    
    <item>
      <title>Ordinal data models</title>
      <link>https://modelingwithr.rbind.io/post/ordinal/ordinal-data-models/</link>
      <pubDate>Tue, 09 Jun 2020 00:00:00 +0000</pubDate>
      <guid>https://modelingwithr.rbind.io/post/ordinal/ordinal-data-models/</guid>
      <description>
&lt;script src=&#34;https://modelingwithr.rbind.io/rmarkdown-libs/header-attrs/header-attrs.js&#34;&gt;&lt;/script&gt;

&lt;div id=&#34;TOC&#34;&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&#34;#introduction&#34;&gt;&lt;span class=&#34;toc-section-number&#34;&gt;1&lt;/span&gt; Introduction&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#data-preparation&#34;&gt;&lt;span class=&#34;toc-section-number&#34;&gt;2&lt;/span&gt; Data preparation&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#ordered-logistic-regression-model-logit&#34;&gt;&lt;span class=&#34;toc-section-number&#34;&gt;3&lt;/span&gt; Ordered logistic regression model (logit)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#ordinal-logistic-rgeression-model-probit&#34;&gt;&lt;span class=&#34;toc-section-number&#34;&gt;4&lt;/span&gt; Ordinal logistic rgeression model (probit)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#cart-model&#34;&gt;&lt;span class=&#34;toc-section-number&#34;&gt;5&lt;/span&gt; CART model&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#ordinal-random-forst-model.&#34;&gt;&lt;span class=&#34;toc-section-number&#34;&gt;6&lt;/span&gt; Ordinal Random forst model.&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#continuation-ratio-model&#34;&gt;&lt;span class=&#34;toc-section-number&#34;&gt;7&lt;/span&gt; Continuation Ratio Model&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#compare-models&#34;&gt;&lt;span class=&#34;toc-section-number&#34;&gt;8&lt;/span&gt; Compare models&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#conclusion&#34;&gt;&lt;span class=&#34;toc-section-number&#34;&gt;9&lt;/span&gt; Conclusion&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#session-information&#34;&gt;&lt;span class=&#34;toc-section-number&#34;&gt;10&lt;/span&gt; Session information&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;

&lt;style type=&#34;text/css&#34;&gt;
strong {
  color: Navy;
}

h1,h2, h3, h4 {
  font-size:28px;
  color:DarkBlue;
}
&lt;/style&gt;
&lt;div id=&#34;introduction&#34; class=&#34;section level1&#34; number=&#34;1&#34;&gt;
&lt;h1&gt;&lt;span class=&#34;header-section-number&#34;&gt;1&lt;/span&gt; Introduction&lt;/h1&gt;
&lt;p&gt;This tutorial aims to explore the most popular models used to predict an ordered response variable. We will use the &lt;strong&gt;heart disease&lt;/strong&gt; data &lt;a href=&#34;https://www.kaggle.com/johnsmith88/heart-disease-dataset&#34;&gt;uploaded from kaggle website&lt;/a&gt;, where our response will be the chest pain &lt;strong&gt;cp&lt;/strong&gt; variable instead of the &lt;strong&gt;target&lt;/strong&gt; variable used usually.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;data-preparation&#34; class=&#34;section level1&#34; number=&#34;2&#34;&gt;
&lt;h1&gt;&lt;span class=&#34;header-section-number&#34;&gt;2&lt;/span&gt; Data preparation&lt;/h1&gt;
&lt;p&gt;First, we call the data and the libraries that we need along with this illustration as follows.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;options(warn = -1)
library(tidyverse)
library(caret)
library(tidymodels)
mydata&amp;lt;-read.csv(&amp;quot;../heart.csv&amp;quot;,header = TRUE)
names(mydata)[1]&amp;lt;-&amp;quot;age&amp;quot;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The data at hand has the following features:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;age.&lt;/li&gt;
&lt;li&gt;sex: 1=male,0=female&lt;/li&gt;
&lt;li&gt;cp : chest pain type.&lt;/li&gt;
&lt;li&gt;trestbps : resting blood pressure.&lt;/li&gt;
&lt;li&gt;chol: serum cholestoral.&lt;/li&gt;
&lt;li&gt;fbs : fasting blood sugar.&lt;/li&gt;
&lt;li&gt;restecg : resting electrocardiographic results.&lt;/li&gt;
&lt;li&gt;thalach : maximum heart rate achieved&lt;/li&gt;
&lt;li&gt;exang : exercise induced angina.&lt;/li&gt;
&lt;li&gt;oldpeak : ST depression induced by exercise relative to rest.&lt;/li&gt;
&lt;li&gt;slope : the slope of the peak exercise ST segment.&lt;/li&gt;
&lt;li&gt;ca : number of major vessels colored by flourosopy.&lt;/li&gt;
&lt;li&gt;thal : it is not well defined from the data source.&lt;/li&gt;
&lt;li&gt;target: have heart disease or not.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;I think the best start to explore the summary of all predictors and missing values is by using the powerful function &lt;strong&gt;skim&lt;/strong&gt; from &lt;strong&gt;skimr&lt;/strong&gt; package.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;skimr::skim(mydata)&lt;/code&gt;&lt;/pre&gt;
&lt;table&gt;
&lt;caption&gt;&lt;span id=&#34;tab:unnamed-chunk-3&#34;&gt;Table 2.1: &lt;/span&gt;Data summary&lt;/caption&gt;
&lt;tbody&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;left&#34;&gt;Name&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;mydata&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;left&#34;&gt;Number of rows&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;303&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;left&#34;&gt;Number of columns&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;14&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;left&#34;&gt;_______________________&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;left&#34;&gt;Column type frequency:&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;left&#34;&gt;numeric&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;14&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;left&#34;&gt;________________________&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;left&#34;&gt;Group variables&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;None&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;&lt;strong&gt;Variable type: numeric&lt;/strong&gt;&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr class=&#34;header&#34;&gt;
&lt;th align=&#34;left&#34;&gt;skim_variable&lt;/th&gt;
&lt;th align=&#34;right&#34;&gt;n_missing&lt;/th&gt;
&lt;th align=&#34;right&#34;&gt;complete_rate&lt;/th&gt;
&lt;th align=&#34;right&#34;&gt;mean&lt;/th&gt;
&lt;th align=&#34;right&#34;&gt;sd&lt;/th&gt;
&lt;th align=&#34;right&#34;&gt;p0&lt;/th&gt;
&lt;th align=&#34;right&#34;&gt;p25&lt;/th&gt;
&lt;th align=&#34;right&#34;&gt;p50&lt;/th&gt;
&lt;th align=&#34;right&#34;&gt;p75&lt;/th&gt;
&lt;th align=&#34;right&#34;&gt;p100&lt;/th&gt;
&lt;th align=&#34;left&#34;&gt;hist&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;left&#34;&gt;age&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;1&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;54.37&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;9.08&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;29&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;47.5&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;55.0&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;61.0&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;77.0&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;▁▆▇▇▁&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;left&#34;&gt;sex&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;1&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0.68&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0.47&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0.0&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;1.0&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;1.0&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;1.0&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;▃▁▁▁▇&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;left&#34;&gt;cp&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;1&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0.97&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;1.03&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0.0&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;1.0&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;2.0&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;3.0&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;▇▃▁▅▁&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;left&#34;&gt;trestbps&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;1&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;131.62&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;17.54&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;94&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;120.0&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;130.0&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;140.0&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;200.0&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;▃▇▅▁▁&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;left&#34;&gt;chol&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;1&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;246.26&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;51.83&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;126&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;211.0&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;240.0&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;274.5&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;564.0&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;▃▇▂▁▁&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;left&#34;&gt;fbs&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;1&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0.15&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0.36&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0.0&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0.0&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0.0&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;1.0&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;▇▁▁▁▂&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;left&#34;&gt;restecg&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;1&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0.53&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0.53&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0.0&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;1.0&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;1.0&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;2.0&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;▇▁▇▁▁&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;left&#34;&gt;thalach&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;1&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;149.65&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;22.91&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;71&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;133.5&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;153.0&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;166.0&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;202.0&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;▁▂▅▇▂&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;left&#34;&gt;exang&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;1&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0.33&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0.47&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0.0&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0.0&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;1.0&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;1.0&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;▇▁▁▁▃&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;left&#34;&gt;oldpeak&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;1&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;1.04&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;1.16&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0.0&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0.8&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;1.6&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;6.2&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;▇▂▁▁▁&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;left&#34;&gt;slope&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;1&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;1.40&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0.62&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;1.0&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;1.0&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;2.0&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;2.0&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;▁▁▇▁▇&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;left&#34;&gt;ca&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;1&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0.73&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;1.02&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0.0&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0.0&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;1.0&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;4.0&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;▇▃▂▁▁&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;left&#34;&gt;thal&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;1&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;2.31&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0.61&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;2.0&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;2.0&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;3.0&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;3.0&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;▁▁▁▇▆&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;left&#34;&gt;target&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;1&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0.54&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0.50&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0.0&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;1.0&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;1.0&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;1.0&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;▇▁▁▁▇&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;For our case we will use the chest pain type &lt;strong&gt;cp&lt;/strong&gt; variable as our target variable since it is a categorical variable. However, for pedagogic purposes, we will manipulate it so that it will be an ordered factor with only three levels &lt;strong&gt;no pain&lt;/strong&gt;,&lt;strong&gt;moderate pain&lt;/strong&gt;, &lt;strong&gt;severe pain&lt;/strong&gt; (instead of 4 levels now).&lt;/p&gt;
&lt;p&gt;Looking at the above output, we convert the variables that should be of factor type, which are: &lt;strong&gt;sex&lt;/strong&gt;, &lt;strong&gt;target&lt;/strong&gt;, &lt;strong&gt;fbs&lt;/strong&gt;, &lt;strong&gt;resecg&lt;/strong&gt;, &lt;strong&gt;exang&lt;/strong&gt;, &lt;strong&gt;slope&lt;/strong&gt;, &lt;strong&gt;ca&lt;/strong&gt;, &lt;strong&gt;thal&lt;/strong&gt;. For the response variable &lt;strong&gt;cp&lt;/strong&gt;, we drop its less frequently level with all its related rows, then we rename the remaining ones as &lt;strong&gt;no&lt;/strong&gt; pain for the most frequently one, &lt;strong&gt;severe&lt;/strong&gt; pain for the less frequently one, and &lt;strong&gt;moderate&lt;/strong&gt; pain for the last one.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;table(mydata$cp)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;
  0   1   2   3 
143  50  87  23 &lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;we see the level &lt;strong&gt;3&lt;/strong&gt; is the less frequently one.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;mydata&amp;lt;-mydata %&amp;gt;%
  modify_at(c(&amp;quot;cp&amp;quot;, &amp;quot;sex&amp;quot;, &amp;quot;target&amp;quot;, &amp;quot;fbs&amp;quot;, &amp;quot;resecg&amp;quot;, &amp;quot;exang&amp;quot;, &amp;quot;slope&amp;quot;, &amp;quot;ca&amp;quot;, &amp;quot;thal&amp;quot;),
            as.factor)
mydata&amp;lt;-mydata[mydata$cp!=3,]
mydata$cp&amp;lt;-fct_drop(mydata$cp,only=levels(mydata$cp))
table(mydata$cp)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;
  0   1   2 
143  50  87 &lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;According to these frequencies we rename and we order the levels as follows.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;mydata$cp&amp;lt;-fct_recode(mydata$cp,no=&amp;quot;0&amp;quot;,sev=&amp;quot;1&amp;quot;,mod=&amp;quot;2&amp;quot;)
mydata$cp&amp;lt;-factor(mydata$cp,ordered = TRUE)
mydata$cp&amp;lt;-fct_infreq(mydata$cp)
mydata$cp[1:5]&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;[1] mod sev sev no  no 
Levels: no &amp;lt; mod &amp;lt; sev&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Similar to the logistic regression, the number of cases in each cell from each cross table between the outcome and each factor should be above the threshold of 5 applied in practice.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;xtabs(~cp+sex,data=mydata)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;     sex
cp      0   1
  no   39 104
  mod  35  52
  sev  18  32&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;xtabs(~cp+target,data=mydata)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;     target
cp      0   1
  no  104  39
  mod  18  69
  sev   9  41&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;xtabs(~cp+fbs,data=mydata)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;     fbs
cp      0   1
  no  125  18
  mod  70  17
  sev  45   5&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;xtabs(~cp+restecg,data=mydata)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;     restecg
cp     0  1  2
  no  78 62  3
  mod 36 50  1
  sev 19 31  0&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;xtabs(~cp+exang,data=mydata)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;     exang
cp     0  1
  no  63 80
  mod 76 11
  sev 46  4&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;xtabs(~cp+slope,data=mydata)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;     slope
cp     0  1  2
  no  11 84 48
  mod  5 33 49
  sev  2 12 36&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;xtabs(~cp+ca,data=mydata)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;     ca
cp     0  1  2  3  4
  no  65 34 29 14  1
  mod 57 20  2  5  3
  sev 37  8  3  1  1&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;xtabs(~cp+thal,data=mydata)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;     thal
cp     0  1  2  3
  no   1 12 52 78
  mod  1  2 62 22
  sev  0  2 39  9&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The following variables do not respect this threshold and hence they will be removed from the predictors set: &lt;strong&gt;restecg&lt;/strong&gt;, &lt;strong&gt;exang&lt;/strong&gt;, &lt;strong&gt;slope&lt;/strong&gt;, &lt;strong&gt;ca&lt;/strong&gt;, and &lt;strong&gt;thal&lt;/strong&gt;.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;mydata&amp;lt;-mydata[,setdiff(names(mydata), 
                        c(&amp;quot;restecg&amp;quot;, &amp;quot;exang&amp;quot;, &amp;quot;slope&amp;quot;, &amp;quot;ca&amp;quot;,  &amp;quot;thal&amp;quot;))]&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The data is ready and we can now split the data between training and testing set.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;set.seed(1122)
parts &amp;lt;- initial_split(mydata, prop=0.8, strata = cp)
train &amp;lt;- training(parts)
test &amp;lt;- testing(parts)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The most popular models that we will use are: ordinal logistic model, cart model, ordinal random forest model, Continuation ratio model.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;ordered-logistic-regression-model-logit&#34; class=&#34;section level1&#34; number=&#34;3&#34;&gt;
&lt;h1&gt;&lt;span class=&#34;header-section-number&#34;&gt;3&lt;/span&gt; Ordered logistic regression model (logit)&lt;/h1&gt;
&lt;p&gt;Before training this type of model let’s show how it works. For simplicity suppose we have data that has an ordered outcome &lt;span class=&#34;math inline&#34;&gt;\(y\)&lt;/span&gt; with three class labels (“1”,“2”,“3”) and only two features &lt;span class=&#34;math inline&#34;&gt;\(x_1\)&lt;/span&gt; and &lt;span class=&#34;math inline&#34;&gt;\(x_2\)&lt;/span&gt;.&lt;/p&gt;
&lt;p&gt;First we define a latent variable as a linear combination of the features:&lt;/p&gt;
&lt;p&gt;&lt;span class=&#34;math display&#34;&gt;\[\begin{equation}
y_i^*=\beta_1 X_{i1}+\beta_2 X_{i2}
\end{equation}\]&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;Then since we have three classes we define two thresholds for this latent variable &lt;span class=&#34;math inline&#34;&gt;\(\alpha_1\)&lt;/span&gt; and &lt;span class=&#34;math inline&#34;&gt;\(\alpha_2\)&lt;/span&gt; such that a particular observation &lt;span class=&#34;math inline&#34;&gt;\(y_i\)&lt;/span&gt; will be classified as follows:&lt;/p&gt;
&lt;p&gt;&lt;span class=&#34;math display&#34;&gt;\[\begin{cases} y_i=1 &amp;amp; \text{if $y_i^* \leq \alpha_1$} \\
                y_i=2 &amp;amp; \text{if $\alpha_1 &amp;lt; y_i^* \leq \alpha_2$} \\
                y_i=3 &amp;amp; \text{if $y_i^* &amp;gt; \alpha_2$}
\end{cases}\]&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;Now we can obtain the probability of a particular observation to fall into a specific class as follows:&lt;/p&gt;
&lt;p&gt;&lt;span class=&#34;math display&#34;&gt;\[\begin{cases} p(y_i=1)=p(y_i^* \leq \alpha_1)=F(\alpha_1-\beta_1 X_{i1}-\beta_2 X_{i2}) \\
                p(y_i=2)=p(\alpha_1 &amp;lt; y_i^* \leq \alpha_2)=F(\alpha_2-\beta_1 X_{i1}-\beta_2 X_{i2})-F(\alpha_1-\beta_1 X_{i1}-\beta_2 X_{i2}) \\
                p(y_i=3)=1-p(y_i=2)-p(y_i=1)\end{cases}\]&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;It remains now to define the suitable distribution function F. There are two commonly used ones for this type of data, the &lt;strong&gt;logit&lt;/strong&gt; function &lt;span class=&#34;math inline&#34;&gt;\(F(x)=\frac{1}{1+exp^{-x}}\)&lt;/span&gt; and the normal distribution function aka &lt;strong&gt;probit&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Note&lt;/strong&gt;: there exist other functions like &lt;strong&gt;loglog&lt;/strong&gt;, &lt;strong&gt;cloglog&lt;/strong&gt;, and &lt;strong&gt;cauchit&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;Using the &lt;strong&gt;logit&lt;/strong&gt; function the probabilities will be.&lt;/p&gt;
&lt;p&gt;&lt;span class=&#34;math display&#34;&gt;\[\begin{cases} p(y_i=1)=\frac{1}{1+exp^{-(\alpha_1-\beta_1 X_{i1}-\beta_2 X_{i2})}} \\
                p(y_i=2)=\frac{1}{1+exp^{-(\alpha_2-\beta_1 X_{i1}-\beta_2 X_{i2})}}-p(y_i=1) \\
                p(y_i=3)=1-p(y_i=2)-p(y_i=1)\end{cases}\]&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;The &lt;strong&gt;MASS&lt;/strong&gt; package provides the method &lt;strong&gt;polr&lt;/strong&gt; to perform an ordinal logistic regression.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;library(MASS)
set.seed(1234)
model_logistic&amp;lt;-train(cp~., data=train,
                      method=&amp;quot;polr&amp;quot;,
                      tuneGrid=expand.grid(method=&amp;quot;logistic&amp;quot;))

summary(model_logistic)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;
Coefficients:
              Value Std. Error  t value
age       0.0112236   0.018219  0.61605
sex1      0.2593720   0.316333  0.81993
trestbps -0.0002329   0.009090 -0.02562
chol     -0.0013238   0.002697 -0.49082
fbs1      0.3188826   0.401836  0.79356
thalach   0.0226246   0.008199  2.75933
oldpeak  -0.3360326   0.163547 -2.05465
target1   1.7234740   0.376279  4.58031

Intercepts:
        Value   Std. Error t value
no|mod   4.5786  1.9271     2.3759
mod|sev  6.5004  1.9527     3.3289

Residual Deviance: 376.4697 
AIC: 396.4697 &lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This table does not provide the p-values. However, it is not a big problem since we can add the p_values by the following script.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;prob &amp;lt;- pnorm(abs(summary(model_logistic)$coefficients[,3]),lower.tail = FALSE)*2
cbind(summary(model_logistic)$coefficients,prob)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;                 Value  Std. Error     t value         prob
age       0.0112236479 0.018218848  0.61604597 5.378642e-01
sex1      0.2593719567 0.316332564  0.81993442 4.122535e-01
trestbps -0.0002329023 0.009090066 -0.02562163 9.795591e-01
chol     -0.0013237835 0.002697079 -0.49082122 6.235529e-01
fbs1      0.3188825831 0.401836034  0.79356393 4.274493e-01
thalach   0.0226246089 0.008199317  2.75932853 5.792027e-03
oldpeak  -0.3360326371 0.163547467 -2.05464899 3.991292e-02
target1   1.7234739863 0.376278770  4.58031152 4.642839e-06
no|mod    4.5785821473 1.927119568  2.37586822 1.750771e-02
mod|sev   6.5003986218 1.952726089  3.32888399 8.719471e-04&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Using the threshold p-value 0.05, we remove the non significant variables. &lt;strong&gt;age&lt;/strong&gt;, &lt;strong&gt;trestbps&lt;/strong&gt;, &lt;strong&gt;chol&lt;/strong&gt;.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;set.seed(1234)
model_logistic&amp;lt;-train(cp~.-age-trestbps-chol, data=train,
                      method=&amp;quot;polr&amp;quot;,tuneGrid=expand.grid(method=&amp;quot;logistic&amp;quot;))
prob &amp;lt;- pnorm(abs(summary(model_logistic)$coefficients[,3]),lower.tail = FALSE)*2
cbind(summary(model_logistic)$coefficients,prob)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;              Value  Std. Error    t value         prob
sex1     0.25427581 0.308143065  0.8251875 4.092651e-01
fbs1     0.37177505 0.384667871  0.9664832 3.338024e-01
thalach  0.02050951 0.007487511  2.7391620 6.159602e-03
oldpeak -0.33669473 0.161699555 -2.0822242 3.732199e-02
target1  1.71338020 0.369558584  4.6362885 3.547208e-06
no|mod   4.00836398 1.143111953  3.5065367 4.539789e-04
mod|sev  5.92987585 1.185074388  5.0038005 5.621092e-07&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Notice that we do not remove the factors &lt;strong&gt;sex&lt;/strong&gt; and &lt;strong&gt;fbs&lt;/strong&gt; even they are not significant due to the significance of the intercepts.&lt;/p&gt;
&lt;p&gt;To well understand these coefficients lets restrict the model with only two predictors.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;set.seed(1234)
model1&amp;lt;-train(cp~target+thalach, 
              data=train,
              method = &amp;quot;polr&amp;quot;,
              tuneGrid=expand.grid(method=&amp;quot;logistic&amp;quot;))
summary(model1)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;
Coefficients:
          Value Std. Error t value
target1 1.87953   0.333153   5.642
thalach 0.02347   0.007372   3.184

Intercepts:
        Value  Std. Error t value
no|mod  4.6457 1.0799     4.3018 
mod|sev 6.5325 1.1271     5.7959 

Residual Deviance: 383.3144 
AIC: 391.3144 &lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Let’s plug in these coefficients in the above equations we obtain the probability of each class as follows:&lt;/p&gt;
&lt;p&gt;&lt;span class=&#34;math display&#34;&gt;\[\begin{cases} p(no)=\frac{1}{1+exp^{-(4.6457-1.87953X_{i1}-0.02347X_{i2})}} \\
                p(mod)=\frac{1}{1+exp^{-(6.5325-1.87953X_{i1}-0.02347X_{i2})}}-p(no) \\
                p(sev)=1-p(mod)-p(no)\end{cases}\]&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;Let’s now predict a particular patient, say the third one.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;train[3,c(&amp;quot;cp&amp;quot;,&amp;quot;thalach&amp;quot;,&amp;quot;target&amp;quot;)]&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;   cp thalach target
4 sev     178      1&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;We plug in the predictor values as follows:&lt;/p&gt;
&lt;p&gt;&lt;span class=&#34;math display&#34;&gt;\[\begin{cases} p(no)=\frac{1}{1+exp^{-(4.6457-1.87953*1-0.02347*178)}} \\
                p(mod)=\frac{1}{1+exp^{-(6.5325-1.87953*1-0.02347*178)}}-p(no) \\
                p(sev)=1-p(mod)-p(no)\end{cases}=\begin{cases} p(no)=0.1959992 \\
                p(mod)=0.6166398-0.1959992=0.4206406 \\
                p(sev)=1-0.4206406-0.1959992=0.3833602\end{cases}\]&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;Using the highest probability this patient will be predicted to have &lt;strong&gt;mod&lt;/strong&gt; pain.
Now let’s compare these probabilities with those obtained from function &lt;strong&gt;predict&lt;/strong&gt;.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;predict(model1, train[1:3,], type = &amp;quot;prob&amp;quot;) %&amp;gt;% tail(1)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;         no       mod      sev
4 0.1958709 0.4205981 0.383531&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Now we go back to our original model and compute the accuracy rate for the training data.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;predict(model_logistic, train) %&amp;gt;% 
  bind_cols(train) %&amp;gt;%
  rename(pred=&amp;quot;...1&amp;quot;, truth=cp) %&amp;gt;% 
  accuracy(pred, truth)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;# A tibble: 1 x 3
  .metric  .estimator .estimate
  &amp;lt;chr&amp;gt;    &amp;lt;chr&amp;gt;          &amp;lt;dbl&amp;gt;
1 accuracy multiclass     0.611&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;with the logistic regression model we get 61% accuracy for the training set, which is quite bad. So let’s test the model using the testing set now.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;predict(model_logistic, test) %&amp;gt;% 
  bind_cols(test) %&amp;gt;%
  rename(pred=&amp;quot;...1&amp;quot;, truth=cp) %&amp;gt;% 
  accuracy(pred, truth)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;# A tibble: 1 x 3
  .metric  .estimator .estimate
  &amp;lt;chr&amp;gt;    &amp;lt;chr&amp;gt;          &amp;lt;dbl&amp;gt;
1 accuracy multiclass     0.648&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Surprisingly, the accuracy rate for the testing set is about 65%, which is larger than that computed from the training data (61%). This is an indication of an underfitting problem (The opposite effect of overfitting problem). Is there any way to improve the model performance? Maybe yes, by going back and tune some hyperparameters, but since we have an underfitting problem we do not have much hyperparameters for this model except the type of function used which is by default the &lt;strong&gt;logistic&lt;/strong&gt; function, but there exist as well other functions like &lt;strong&gt;probit&lt;/strong&gt;, &lt;strong&gt;loglog&lt;/strong&gt;, …ect.&lt;/p&gt;
&lt;p&gt;For our case let’s try this model with the probit function&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;ordinal-logistic-rgeression-model-probit&#34; class=&#34;section level1&#34; number=&#34;4&#34;&gt;
&lt;h1&gt;&lt;span class=&#34;header-section-number&#34;&gt;4&lt;/span&gt; Ordinal logistic rgeression model (probit)&lt;/h1&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;set.seed(1234)
model_probit&amp;lt;-train(cp~.-age-trestbps-chol, data=train,                                        method=&amp;quot;polr&amp;quot;,
                    tuneGrid=expand.grid(method=&amp;quot;probit&amp;quot;))

predict(model_probit, train) %&amp;gt;% 
  bind_cols(train) %&amp;gt;%
  rename(pred=&amp;quot;...1&amp;quot;, truth=cp) %&amp;gt;% 
  accuracy(pred, truth)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;# A tibble: 1 x 3
  .metric  .estimator .estimate
  &amp;lt;chr&amp;gt;    &amp;lt;chr&amp;gt;          &amp;lt;dbl&amp;gt;
1 accuracy multiclass     0.606&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This rate is slightly worse than that from the previous model. But what about the testing set.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;predict(model_probit, test) %&amp;gt;% 
  bind_cols(test) %&amp;gt;%
  rename(pred=&amp;quot;...1&amp;quot;, truth=cp) %&amp;gt;% 
  accuracy(pred, truth)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;# A tibble: 1 x 3
  .metric  .estimator .estimate
  &amp;lt;chr&amp;gt;    &amp;lt;chr&amp;gt;          &amp;lt;dbl&amp;gt;
1 accuracy multiclass     0.593&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This one also is worse than the previous model. So this means that the logistic function for this data performs better than the probit one.&lt;/p&gt;
&lt;p&gt;When we try many things to improve the model performance and we do not gain much, it will be better to think to try different types of models.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;cart-model&#34; class=&#34;section level1&#34; number=&#34;5&#34;&gt;
&lt;h1&gt;&lt;span class=&#34;header-section-number&#34;&gt;5&lt;/span&gt; CART model&lt;/h1&gt;
&lt;p&gt;This is a tree-based model used both for classification and regression. To train this model we make use of &lt;strong&gt;rpartScore&lt;/strong&gt; package, and for simplification, we will include only the significant predictors from the previous model.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;library(rpartScore)
set.seed(1234)
model_cart&amp;lt;-train(cp~.-age-trestbps-chol, data=train,
                      method=&amp;quot;rpartScore&amp;quot;)
model_cart&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;CART or Ordinal Responses 

226 samples
  8 predictor
  3 classes: &amp;#39;no&amp;#39;, &amp;#39;mod&amp;#39;, &amp;#39;sev&amp;#39; 

No pre-processing
Resampling: Bootstrapped (25 reps) 
Summary of sample sizes: 226, 226, 226, 226, 226, 226, ... 
Resampling results across tuning parameters:

  cp          split  prune  Accuracy   Kappa    
  0.02702703  abs    mr     0.5748197  0.2845545
  0.02702703  abs    mc     0.5796085  0.3011122
  0.02702703  quad   mr     0.5711605  0.2764466
  0.02702703  quad   mc     0.5805216  0.3020125
  0.04504505  abs    mr     0.5620975  0.2719646
  0.04504505  abs    mc     0.5966801  0.3274893
  0.04504505  quad   mr     0.5592845  0.2608402
  0.04504505  quad   mc     0.5930817  0.3208220
  0.21621622  abs    mr     0.5303342  0.1266324
  0.21621622  abs    mc     0.6004116  0.3343997
  0.21621622  quad   mr     0.5290009  0.1143360
  0.21621622  quad   mc     0.5928132  0.3225686

Accuracy was used to select the optimal model using the largest value.
The final values used for the model were cp = 0.2162162, split = abs and
 prune = mc.&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The caret model uses the bootstrapping technique for hyperparameters tuning. In our case, the largest accuracy rate is about 59.59%, with the complexity parameter &lt;code&gt;**cp**=0.2162162&lt;/code&gt;, the &lt;code&gt;**split**=abs&lt;/code&gt;, and &lt;code&gt;**prune**= **mc**&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;The argument &lt;strong&gt;split&lt;/strong&gt; controls the splitting function used to grow the tree by setting the misclassification costs in the generalized &lt;strong&gt;Gini&lt;/strong&gt; impurity function to the absolute &lt;strong&gt;abs&lt;/strong&gt; or squared &lt;strong&gt;quad&lt;/strong&gt;.
The argument &lt;strong&gt;prune&lt;/strong&gt; is used to select the performance measure to prune the tree between total misclassification rate &lt;strong&gt;mr&lt;/strong&gt; or misclassification cost &lt;strong&gt;mc&lt;/strong&gt;.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;predict(model_cart, train) %&amp;gt;% 
  bind_cols(train) %&amp;gt;%
  rename(pred=&amp;quot;...1&amp;quot;, truth=cp) %&amp;gt;% 
  accuracy(pred, truth)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;# A tibble: 1 x 3
  .metric  .estimator .estimate
  &amp;lt;chr&amp;gt;    &amp;lt;chr&amp;gt;          &amp;lt;dbl&amp;gt;
1 accuracy multiclass     0.615&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Surprisingly, we get approximately the same accuracy rate as the logit model. Let’s check the testing set.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;predict(model_cart, test) %&amp;gt;% 
  bind_cols(test) %&amp;gt;%
  rename(pred=&amp;quot;...1&amp;quot;, truth=cp) %&amp;gt;% 
  accuracy(pred, truth)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;# A tibble: 1 x 3
  .metric  .estimator .estimate
  &amp;lt;chr&amp;gt;    &amp;lt;chr&amp;gt;          &amp;lt;dbl&amp;gt;
1 accuracy multiclass     0.630&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Now wit this model we get a lower accuracy rate than that of the logistic model.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;ordinal-random-forst-model.&#34; class=&#34;section level1&#34; number=&#34;6&#34;&gt;
&lt;h1&gt;&lt;span class=&#34;header-section-number&#34;&gt;6&lt;/span&gt; Ordinal Random forst model.&lt;/h1&gt;
&lt;p&gt;This model is a corrected version of random forest model that takes into account the ordinal nature of the response variable. For more detail about this model read this great &lt;a href=&#34;https://pdfs.semanticscholar.org/5bb3/5b76774bf0d582eda4ec06e2cb3ce021772c.pdf&#34;&gt;paper&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;To train ordinal random forest model, we need to call the following packages:
&lt;strong&gt;e1071&lt;/strong&gt;, &lt;strong&gt;ranger&lt;/strong&gt;, &lt;strong&gt;ordinalForest&lt;/strong&gt;.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;library(ordinalForest)
library(ranger)
library(e1071)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Since the create function &lt;strong&gt;train&lt;/strong&gt; use bootstrapping method to perform hyperparameters tuning to choose the best values, this makes the training process very slow, that is why i save the resulted output and load it again&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# set.seed(1234)
# model_forest&amp;lt;-train(cp~.-age-trestbps-chol, data=train,
#                       method=&amp;#39;ordinalRF&amp;#39;)

# saveRDS(model_forest, #&amp;quot;C://Users/dell/Documents/new-blog/content/post/ordinal/model_forest.rds&amp;quot;)

model_forest &amp;lt;- readRDS(&amp;quot;C://Users/dell/Documents/new-blog/content/post/ordinal/model_forest.rds&amp;quot;)

model_forest&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;Random Forest 

226 samples
  8 predictor
  3 classes: &amp;#39;no&amp;#39;, &amp;#39;mod&amp;#39;, &amp;#39;sev&amp;#39; 

No pre-processing
Resampling: Bootstrapped (25 reps) 
Summary of sample sizes: 226, 226, 226, 226, 226, 226, ... 
Resampling results across tuning parameters:

  nsets  ntreeperdiv  ntreefinal  Accuracy   Kappa    
   50     50          200         0.5808002  0.3008422
   50     50          400         0.5776249  0.2954635
   50     50          600         0.5802381  0.3009845
   50    100          200         0.5805333  0.2982787
   50    100          400         0.5835550  0.3046105
   50    100          600         0.5792347  0.2966789
   50    150          200         0.5781306  0.2957198
   50    150          400         0.5763106  0.2929363
   50    150          600         0.5773418  0.2939428
  100     50          200         0.5825633  0.3037443
  100     50          400         0.5766958  0.2946094
  100     50          600         0.5801625  0.2992074
  100    100          200         0.5817261  0.3017512
  100    100          400         0.5802315  0.2984311
  100    100          600         0.5760195  0.2936909
  100    150          200         0.5791770  0.2986367
  100    150          400         0.5773527  0.2940674
  100    150          600         0.5800019  0.2990121
  150     50          200         0.5738722  0.2890697
  150     50          400         0.5755389  0.2915668
  150     50          600         0.5793087  0.2994984
  150    100          200         0.5821339  0.3039247
  150    100          400         0.5810183  0.3003594
  150    100          600         0.5797573  0.3001752
  150    150          200         0.5792505  0.2992324
  150    150          400         0.5757645  0.2930867
  150    150          600         0.5802099  0.2993488

Accuracy was used to select the optimal model using the largest value.
The final values used for the model were nsets = 50, ntreeperdiv = 100
 and ntreefinal = 400.&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;We can plot the important predictors as follows.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;plot(varImp(model_forest))&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://modelingwithr.rbind.io/post/ordinal/2020-06-09-ordinal-data-models.en_files/figure-html/unnamed-chunk-25-1.svg&#34; width=&#34;576&#34; /&gt;&lt;/p&gt;
&lt;p&gt;Now we can obtain the accuracy rate for the training rate as follows.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;predict(model_forest, train) %&amp;gt;% 
  bind_cols(train) %&amp;gt;%
  rename(pred=&amp;quot;...1&amp;quot;, truth=cp) %&amp;gt;% 
  accuracy(pred, truth)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;# A tibble: 1 x 3
  .metric  .estimator .estimate
  &amp;lt;chr&amp;gt;    &amp;lt;chr&amp;gt;          &amp;lt;dbl&amp;gt;
1 accuracy multiclass     0.819&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Great!, with this model, the accuracy rate has largely improved to roughly 84%. But wait, what matters is the accuracy of the testing set.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;predict(model_forest, test) %&amp;gt;% 
  bind_cols(test) %&amp;gt;%
  rename(pred=&amp;quot;...1&amp;quot;, truth=cp) %&amp;gt;% 
  accuracy(pred, truth)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;# A tibble: 1 x 3
  .metric  .estimator .estimate
  &amp;lt;chr&amp;gt;    &amp;lt;chr&amp;gt;          &amp;lt;dbl&amp;gt;
1 accuracy multiclass     0.574&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This is exactly what is called the overfitting problem. The model generalizes poorly to new unseen data. We can go back and tune some other hyperparameters like increasing the minimum size of nodes (default is 5) to fight the overfitting problem. we do not, however, do that here since it is not the purpose of this tutorial.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;continuation-ratio-model&#34; class=&#34;section level1&#34; number=&#34;7&#34;&gt;
&lt;h1&gt;&lt;span class=&#34;header-section-number&#34;&gt;7&lt;/span&gt; Continuation Ratio Model&lt;/h1&gt;
&lt;p&gt;This model uses The vector generalized additive models which are available in the &lt;strong&gt;VGAM&lt;/strong&gt; package. for more detail about these models click &lt;a href=&#34;https://cran.r-project.org/web/packages/VGAM/vignettes/categoricalVGAM.pdf&#34;&gt;here&lt;/a&gt;.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;library(VGAM)
set.seed(1234)
model_vgam&amp;lt;-train(cp~.-age-trestbps-chol, data=train,
                  method=&amp;quot;vglmContRatio&amp;quot;, trace=FALSE)&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;model_vgam&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;Continuation Ratio Model for Ordinal Data 

226 samples
  8 predictor
  3 classes: &amp;#39;no&amp;#39;, &amp;#39;mod&amp;#39;, &amp;#39;sev&amp;#39; 

No pre-processing
Resampling: Bootstrapped (25 reps) 
Summary of sample sizes: 226, 226, 226, 226, 226, 226, ... 
Resampling results across tuning parameters:

  parallel  link     Accuracy   Kappa    
  FALSE     logit    0.5962581  0.3323075
  FALSE     probit   0.5942637  0.3302998
  FALSE     cloglog  0.5973844  0.3293056
  FALSE     cauchit  0.5967368  0.3316896
  FALSE     logc     0.5945121  0.3152759
   TRUE     logit    0.5758330  0.2961673
   TRUE     probit   0.5738297  0.2924747
   TRUE     cloglog  0.5838764  0.3014038
   TRUE     cauchit  0.5810184  0.3067004
   TRUE     logc     0.5302522  0.1031624

Accuracy was used to select the optimal model using the largest value.
The final values used for the model were parallel = FALSE and link = cloglog.&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;the best model is obtained when the argument &lt;strong&gt;parallel&lt;/strong&gt; is FALSE and &lt;strong&gt;link&lt;/strong&gt; is &lt;strong&gt;cauchit&lt;/strong&gt; which is the tangent function.&lt;/p&gt;
&lt;p&gt;The accuracy rate of the training data is:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;predict(model_vgam, train) %&amp;gt;% 
  bind_cols(train) %&amp;gt;%
  rename(pred=&amp;quot;...1&amp;quot;, truth=cp) %&amp;gt;% 
  accuracy(pred, truth)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;# A tibble: 1 x 3
  .metric  .estimator .estimate
  &amp;lt;chr&amp;gt;    &amp;lt;chr&amp;gt;          &amp;lt;dbl&amp;gt;
1 accuracy multiclass     0.659&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;And the accuracy of the testing set is:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;predict(model_vgam, test) %&amp;gt;% 
  bind_cols(test) %&amp;gt;%
  rename(pred=&amp;quot;...1&amp;quot;, truth=cp) %&amp;gt;% 
  accuracy(pred, truth)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;# A tibble: 1 x 3
  .metric  .estimator .estimate
  &amp;lt;chr&amp;gt;    &amp;lt;chr&amp;gt;          &amp;lt;dbl&amp;gt;
1 accuracy multiclass     0.630&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This the best accuracy rate compared to the other models.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;compare-models&#34; class=&#34;section level1&#34; number=&#34;8&#34;&gt;
&lt;h1&gt;&lt;span class=&#34;header-section-number&#34;&gt;8&lt;/span&gt; Compare models&lt;/h1&gt;
&lt;p&gt;We can compare between the above models using &lt;strong&gt;resample&lt;/strong&gt; caret function.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;models_eval&amp;lt;-resamples(list(logit=model_logistic,
                            cart=model_cart,
                            forest=model_forest,
                            vgam=model_vgam))
summary(models_eval)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;
Call:
summary.resamples(object = models_eval)

Models: logit, cart, forest, vgam 
Number of resamples: 25 

Accuracy 
            Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA&amp;#39;s
logit  0.5060241 0.5731707 0.5822785 0.5871083 0.6097561 0.6627907    0
cart   0.3734940 0.5824176 0.6097561 0.6004116 0.6279070 0.6746988    0
forest 0.4891304 0.5609756 0.5853659 0.5835550 0.6162791 0.6385542    0
vgam   0.4936709 0.5760870 0.6046512 0.5973844 0.6202532 0.6626506    0

Kappa 
               Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA&amp;#39;s
logit   0.189086980 0.2792369 0.3144822 0.3100458 0.3437500 0.4512651    0
cart   -0.004889406 0.3185420 0.3474144 0.3343997 0.3775576 0.4526136    0
forest  0.186912373 0.2719432 0.3091678 0.3046105 0.3464604 0.4011544    0
vgam    0.144558744 0.2993406 0.3367647 0.3293056 0.3690791 0.4142980    0&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Based on the training set and using the mean of the accuracy rate we can say that &lt;strong&gt;cart&lt;/strong&gt; model is the best model for this data with 60.97% accuracy for the training set. However, things are different when it comes to use the testing set instead.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;tibble(models=c(&amp;quot;logit&amp;quot;, &amp;quot;cart&amp;quot;, &amp;quot;forest&amp;quot;, &amp;quot;vgam&amp;quot;), 
       accuracy=c(
  predict(model_logistic, test) %&amp;gt;% 
  bind_cols(test) %&amp;gt;%
  rename(pred=&amp;quot;...1&amp;quot;, truth=cp) %&amp;gt;% 
  accuracy(pred, truth) %&amp;gt;% 
  .[, &amp;quot;.estimate&amp;quot;],
  predict(model_cart, test) %&amp;gt;% 
  bind_cols(test) %&amp;gt;%
  rename(pred=&amp;quot;...1&amp;quot;, truth=cp) %&amp;gt;% 
  accuracy(pred, truth) %&amp;gt;% 
  .[, &amp;quot;.estimate&amp;quot;],
  predict(model_forest, test) %&amp;gt;% 
  bind_cols(test) %&amp;gt;%
  rename(pred=&amp;quot;...1&amp;quot;, truth=cp) %&amp;gt;% 
  accuracy(pred, truth) %&amp;gt;% 
  .[, &amp;quot;.estimate&amp;quot;],
  predict(model_vgam, test) %&amp;gt;% 
  bind_cols(test) %&amp;gt;%
  rename(pred=&amp;quot;...1&amp;quot;, truth=cp) %&amp;gt;% 
  accuracy(pred, truth) %&amp;gt;% 
  .[, &amp;quot;.estimate&amp;quot;])) %&amp;gt;% 
  unnest()&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;# A tibble: 4 x 2
  models accuracy
  &amp;lt;chr&amp;gt;     &amp;lt;dbl&amp;gt;
1 logit     0.648
2 cart      0.630
3 forest    0.574
4 vgam      0.630&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Using the testing set, the logistic model with the link &lt;strong&gt;logit&lt;/strong&gt; is the best model to predict this data.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;conclusion&#34; class=&#34;section level1&#34; number=&#34;9&#34;&gt;
&lt;h1&gt;&lt;span class=&#34;header-section-number&#34;&gt;9&lt;/span&gt; Conclusion&lt;/h1&gt;
&lt;p&gt;We have seen so far how to model ordinal data by exploring several models, and it happened that the logistic model is the best on for our data. However, in general the best model depends strongly on the data at hand.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;session-information&#34; class=&#34;section level1&#34; number=&#34;10&#34;&gt;
&lt;h1&gt;&lt;span class=&#34;header-section-number&#34;&gt;10&lt;/span&gt; Session information&lt;/h1&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;sessionInfo()&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;R version 4.0.1 (2020-06-06)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19041)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252 
[2] LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] splines   stats4    stats     graphics  grDevices utils     datasets 
[8] methods   base     

other attached packages:
 [1] VGAM_1.1-3          e1071_1.7-3         ranger_0.12.1      
 [4] ordinalForest_2.4-1 rpartScore_1.0-1    rpart_4.1-15       
 [7] MASS_7.3-53         yardstick_0.0.7     workflows_0.2.0    
[10] tune_0.1.1          rsample_0.0.8       recipes_0.1.13     
[13] parsnip_0.1.3       modeldata_0.0.2     infer_0.5.3        
[16] dials_0.0.9         scales_1.1.1        broom_0.7.1        
[19] tidymodels_0.1.1    caret_6.0-86        lattice_0.20-41    
[22] forcats_0.5.0       stringr_1.4.0       dplyr_1.0.2        
[25] purrr_0.3.4         readr_1.3.1         tidyr_1.1.2        
[28] tibble_3.0.3        ggplot2_3.3.2       tidyverse_1.3.0    

loaded via a namespace (and not attached):
 [1] colorspace_1.4-1     ellipsis_0.3.1       class_7.3-17        
 [4] base64enc_0.1-3      fs_1.5.0             rstudioapi_0.11     
 [7] listenv_0.8.0        furrr_0.1.0          prodlim_2019.11.13  
[10] fansi_0.4.1          lubridate_1.7.9      xml2_1.3.2          
[13] codetools_0.2-16     knitr_1.30           jsonlite_1.7.1      
[16] pROC_1.16.2          dbplyr_1.4.4         compiler_4.0.1      
[19] httr_1.4.2           backports_1.1.10     assertthat_0.2.1    
[22] Matrix_1.2-18        cli_2.0.2            htmltools_0.5.0     
[25] tools_4.0.1          gtable_0.3.0         glue_1.4.2          
[28] reshape2_1.4.4       Rcpp_1.0.5           cellranger_1.1.0    
[31] DiceDesign_1.8-1     vctrs_0.3.4          nlme_3.1-149        
[34] blogdown_0.20        iterators_1.0.12     timeDate_3043.102   
[37] gower_0.2.2          xfun_0.18            globals_0.13.0      
[40] rvest_0.3.6          lifecycle_0.2.0      future_1.19.1       
[43] ipred_0.9-9          hms_0.5.3            parallel_4.0.1      
[46] yaml_2.2.1           stringi_1.5.3        highr_0.8           
[49] foreach_1.5.0        lhs_1.1.0            lava_1.6.8          
[52] repr_1.1.0           rlang_0.4.7          pkgconfig_2.0.3     
[55] evaluate_0.14        tidyselect_1.1.0     plyr_1.8.6          
[58] magrittr_1.5         bookdown_0.20        R6_2.4.1            
[61] generics_0.0.2       DBI_1.1.0            pillar_1.4.6        
[64] haven_2.3.1          withr_2.3.0          survival_3.2-7      
[67] nnet_7.3-14          modelr_0.1.8         crayon_1.3.4        
[70] utf8_1.1.4           rmarkdown_2.4        grid_4.0.1          
[73] readxl_1.3.1         data.table_1.13.0    blob_1.2.1          
[76] ModelMetrics_1.2.2.2 reprex_0.3.0         digest_0.6.25       
[79] munsell_0.5.0        GPfit_1.0-8          skimr_2.1.2         &lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
</description>
    </item>
    
    <item>
      <title>Predicting binary response variable with h2o framework</title>
      <link>https://modelingwithr.rbind.io/sparklyr/h2o/predicting-binary-response-variable-with-h2o-framework/</link>
      <pubDate>Wed, 03 Jun 2020 00:00:00 +0000</pubDate>
      <guid>https://modelingwithr.rbind.io/sparklyr/h2o/predicting-binary-response-variable-with-h2o-framework/</guid>
      <description>
&lt;script src=&#34;https://modelingwithr.rbind.io/rmarkdown-libs/header-attrs/header-attrs.js&#34;&gt;&lt;/script&gt;

&lt;div id=&#34;TOC&#34;&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&#34;#introduction&#34;&gt;&lt;span class=&#34;toc-section-number&#34;&gt;1&lt;/span&gt; Introduction&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#data-preparation&#34;&gt;&lt;span class=&#34;toc-section-number&#34;&gt;2&lt;/span&gt; data preparation&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#logistic-regression&#34;&gt;&lt;span class=&#34;toc-section-number&#34;&gt;3&lt;/span&gt; Logistic regression&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#random-forest&#34;&gt;&lt;span class=&#34;toc-section-number&#34;&gt;4&lt;/span&gt; Random forest&lt;/a&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&#34;#random-forest-with-binomial-double-trees&#34;&gt;&lt;span class=&#34;toc-section-number&#34;&gt;4.1&lt;/span&gt; Random forest with binomial double trees&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#random-forest-tuning&#34;&gt;&lt;span class=&#34;toc-section-number&#34;&gt;4.2&lt;/span&gt; Random forest tuning&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#deep-learning-model&#34;&gt;&lt;span class=&#34;toc-section-number&#34;&gt;5&lt;/span&gt; Deep learning model&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#conclusion&#34;&gt;&lt;span class=&#34;toc-section-number&#34;&gt;6&lt;/span&gt; Conclusion:&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#further-reading&#34;&gt;&lt;span class=&#34;toc-section-number&#34;&gt;7&lt;/span&gt; Further reading&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#session-information&#34;&gt;&lt;span class=&#34;toc-section-number&#34;&gt;8&lt;/span&gt; Session information&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;

&lt;style type=&#34;text/css&#34;&gt;
strong {
  color: Navy;
}

h1,h2, h3, h4 {
  font-size:28px;
  color:DarkBlue;
}
&lt;/style&gt;
&lt;div id=&#34;introduction&#34; class=&#34;section level1&#34; number=&#34;1&#34;&gt;
&lt;h1&gt;&lt;span class=&#34;header-section-number&#34;&gt;1&lt;/span&gt; Introduction&lt;/h1&gt;
&lt;p&gt;H2O is an open-source distributed scalable framework used to train machine learning and deep learning models as well as data analysis. It can handle large data sets, with ease of use, by creating a cluster from the available nodes. Fortunately, it provides an API for R users to get the most benefits from it, especially when it comes to large data sets, with which R has its most limitations.&lt;/p&gt;
&lt;p&gt;The beauty is that R users can load and use this system via the package &lt;strong&gt;h2o&lt;/strong&gt; which can be called and used like any other R packages.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# install.packages(&amp;quot;h2o&amp;quot;) if not already installed
library(tidyverse)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;-- Attaching packages -------------&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;v ggplot2 3.3.2     v purrr   0.3.4
v tibble  3.0.3     v dplyr   1.0.2
v tidyr   1.1.2     v stringr 1.4.0
v readr   1.3.1     v forcats 0.5.0&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;Warning: package &amp;#39;ggplot2&amp;#39; was built under R version 4.0.2&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;Warning: package &amp;#39;tibble&amp;#39; was built under R version 4.0.2&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;Warning: package &amp;#39;tidyr&amp;#39; was built under R version 4.0.2&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;Warning: package &amp;#39;dplyr&amp;#39; was built under R version 4.0.2&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;-- Conflicts ----------------------
x dplyr::filter() masks stats::filter()
x dplyr::lag()    masks stats::lag()&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;library(h2o)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;
----------------------------------------------------------------------

Your next step is to start H2O:
    &amp;gt; h2o.init()

For H2O package documentation, ask for help:
    &amp;gt; ??h2o

After starting H2O, you can use the Web UI at http://localhost:54321
For more information visit https://docs.h2o.ai

----------------------------------------------------------------------&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;
Attaching package: &amp;#39;h2o&amp;#39;&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;The following objects are masked from &amp;#39;package:stats&amp;#39;:

    cor, sd, var&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;The following objects are masked from &amp;#39;package:base&amp;#39;:

    %*%, %in%, &amp;amp;&amp;amp;, ||, apply, as.factor, as.numeric, colnames,
    colnames&amp;lt;-, ifelse, is.character, is.factor, is.numeric, log,
    log10, log1p, log2, round, signif, trunc&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Then to lunch the cluster run the following script&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;h2o.init(nthreads = -1)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;
H2O is not running yet, starting it now...

Note:  In case of errors look at the following log files:
    C:\Users\dell\AppData\Local\Temp\RtmpGuHBDV\file2e5438ee3e8c/h2o_dell_started_from_r.out
    C:\Users\dell\AppData\Local\Temp\RtmpGuHBDV\file2e54103214ed/h2o_dell_started_from_r.err


Starting H2O JVM and connecting: . Connection successful!

R is connected to the H2O cluster: 
    H2O cluster uptime:         6 seconds 974 milliseconds 
    H2O cluster timezone:       Europe/Paris 
    H2O data parsing timezone:  UTC 
    H2O cluster version:        3.30.1.3 
    H2O cluster version age:    13 days  
    H2O cluster name:           H2O_started_from_R_dell_sgv874 
    H2O cluster total nodes:    1 
    H2O cluster total memory:   1.99 GB 
    H2O cluster total cores:    4 
    H2O cluster allowed cores:  4 
    H2O cluster healthy:        TRUE 
    H2O Connection ip:          localhost 
    H2O Connection port:        54321 
    H2O Connection proxy:       NA 
    H2O Internal Security:      FALSE 
    H2O API Extensions:         Amazon S3, Algos, AutoML, Core V3, TargetEncoder, Core V4 
    R Version:                  R version 4.0.1 (2020-06-06) &lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Looking at this output, we see that h2o uses java virtual machine JVM, so you need java already installed. If you notice I have specified the &lt;strong&gt;nthreads&lt;/strong&gt; argument to be -1 to tell h20 to create its cluster using all the available cores I have less than 1.&lt;/p&gt;
&lt;p&gt;Since our purpose is understanding how to work with h2o, we are going be using a small data set, in which the response will be a binary variable. The data that we will use is &lt;strong&gt;creditcard&lt;/strong&gt; which is downloaded from &lt;a href=&#34;https://www.kaggle.com/mlg-ulb/creditcardfraud&#34;&gt;kaggle&lt;/a&gt; website.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;data-preparation&#34; class=&#34;section level1&#34; number=&#34;2&#34;&gt;
&lt;h1&gt;&lt;span class=&#34;header-section-number&#34;&gt;2&lt;/span&gt; data preparation&lt;/h1&gt;
&lt;p&gt;To import the data directly into the h2o cluster we use the function &lt;strong&gt;h2O.importFile&lt;/strong&gt; as follows.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;card &amp;lt;- h2o.importFile(&amp;quot;../creditcard.csv&amp;quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;
  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |=============================================================         |  87%
  |                                                                            
  |======================================================================| 100%&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The following script gives the dimension of this data.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;h2o.dim(card)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;[1] 284807     31&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This data has 284807 and 31 columns. According to the description of this data, the response variable is &lt;strong&gt;class&lt;/strong&gt; with two values 1 for &lt;strong&gt;fraudulent card&lt;/strong&gt; and 0 for &lt;strong&gt;regular card&lt;/strong&gt;. The other variables are PCA components derived from the original ones for privacy purposes to protect, for instance, the users’ identities.&lt;br /&gt;
So first let’s check the summary of this data.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;knitr::kable(h2o.describe(card))&lt;/code&gt;&lt;/pre&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr class=&#34;header&#34;&gt;
&lt;th align=&#34;left&#34;&gt;Label&lt;/th&gt;
&lt;th align=&#34;left&#34;&gt;Type&lt;/th&gt;
&lt;th align=&#34;right&#34;&gt;Missing&lt;/th&gt;
&lt;th align=&#34;right&#34;&gt;Zeros&lt;/th&gt;
&lt;th align=&#34;right&#34;&gt;PosInf&lt;/th&gt;
&lt;th align=&#34;right&#34;&gt;NegInf&lt;/th&gt;
&lt;th align=&#34;right&#34;&gt;Min&lt;/th&gt;
&lt;th align=&#34;right&#34;&gt;Max&lt;/th&gt;
&lt;th align=&#34;right&#34;&gt;Mean&lt;/th&gt;
&lt;th align=&#34;right&#34;&gt;Sigma&lt;/th&gt;
&lt;th align=&#34;right&#34;&gt;Cardinality&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;left&#34;&gt;Time&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;int&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;2&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0.000000&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;1.727920e+05&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;9.481386e+04&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;4.748815e+04&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;NA&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;left&#34;&gt;V1&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;real&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;-56.407510&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;2.454930e+00&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0.000000e+00&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;1.958696e+00&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;NA&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;left&#34;&gt;V2&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;real&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;-72.715728&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;2.205773e+01&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0.000000e+00&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;1.651309e+00&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;NA&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;left&#34;&gt;V3&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;real&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;-48.325589&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;9.382558e+00&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0.000000e+00&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;1.516255e+00&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;NA&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;left&#34;&gt;V4&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;real&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;-5.683171&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;1.687534e+01&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0.000000e+00&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;1.415869e+00&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;NA&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;left&#34;&gt;V5&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;real&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;-113.743307&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;3.480167e+01&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0.000000e+00&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;1.380247e+00&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;NA&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;left&#34;&gt;V6&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;real&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;-26.160506&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;7.330163e+01&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0.000000e+00&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;1.332271e+00&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;NA&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;left&#34;&gt;V7&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;real&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;-43.557242&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;1.205895e+02&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0.000000e+00&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;1.237094e+00&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;NA&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;left&#34;&gt;V8&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;real&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;-73.216718&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;2.000721e+01&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0.000000e+00&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;1.194353e+00&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;NA&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;left&#34;&gt;V9&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;real&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;-13.434066&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;1.559499e+01&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0.000000e+00&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;1.098632e+00&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;NA&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;left&#34;&gt;V10&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;real&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;-24.588262&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;2.374514e+01&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0.000000e+00&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;1.088850e+00&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;NA&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;left&#34;&gt;V11&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;real&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;-4.797473&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;1.201891e+01&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0.000000e+00&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;1.020713e+00&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;NA&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;left&#34;&gt;V12&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;real&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;-18.683715&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;7.848392e+00&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0.000000e+00&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;9.992014e-01&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;NA&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;left&#34;&gt;V13&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;real&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;-5.791881&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;7.126883e+00&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0.000000e+00&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;9.952742e-01&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;NA&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;left&#34;&gt;V14&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;real&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;-19.214326&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;1.052677e+01&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0.000000e+00&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;9.585956e-01&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;NA&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;left&#34;&gt;V15&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;real&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;-4.498945&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;8.877742e+00&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0.000000e+00&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;9.153160e-01&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;NA&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;left&#34;&gt;V16&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;real&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;-14.129855&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;1.731511e+01&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0.000000e+00&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;8.762529e-01&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;NA&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;left&#34;&gt;V17&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;real&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;-25.162799&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;9.253526e+00&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0.000000e+00&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;8.493371e-01&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;NA&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;left&#34;&gt;V18&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;real&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;-9.498746&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;5.041069e+00&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0.000000e+00&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;8.381762e-01&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;NA&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;left&#34;&gt;V19&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;real&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;-7.213527&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;5.591971e+00&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0.000000e+00&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;8.140405e-01&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;NA&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;left&#34;&gt;V20&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;real&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;-54.497720&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;3.942090e+01&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0.000000e+00&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;7.709250e-01&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;NA&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;left&#34;&gt;V21&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;real&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;-34.830382&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;2.720284e+01&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0.000000e+00&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;7.345240e-01&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;NA&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;left&#34;&gt;V22&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;real&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;-10.933144&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;1.050309e+01&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0.000000e+00&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;7.257016e-01&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;NA&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;left&#34;&gt;V23&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;real&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;-44.807735&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;2.252841e+01&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0.000000e+00&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;6.244603e-01&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;NA&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;left&#34;&gt;V24&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;real&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;-2.836627&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;4.584549e+00&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0.000000e+00&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;6.056471e-01&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;NA&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;left&#34;&gt;V25&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;real&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;-10.295397&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;7.519589e+00&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0.000000e+00&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;5.212781e-01&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;NA&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;left&#34;&gt;V26&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;real&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;-2.604551&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;3.517346e+00&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0.000000e+00&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;4.822270e-01&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;NA&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;left&#34;&gt;V27&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;real&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;-22.565679&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;3.161220e+01&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0.000000e+00&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;4.036325e-01&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;NA&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;left&#34;&gt;V28&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;real&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;-15.430084&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;3.384781e+01&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0.000000e+00&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;3.300833e-01&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;NA&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;left&#34;&gt;Amount&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;real&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;1825&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0.000000&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;2.569116e+04&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;8.834962e+01&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;2.501201e+02&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;NA&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;left&#34;&gt;Class&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;int&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;284315&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0.000000&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;1.000000e+00&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;1.727500e-03&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;4.152720e-02&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;NA&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;The most important issues that we usually check first are missing values and imbalance problems for classification.&lt;/p&gt;
&lt;p&gt;For the missing values, you should know that a value recognized by R as a missing value if it is written as &lt;strong&gt;NA&lt;/strong&gt; or blank cells. If, otherwise a missing value in imported data written in any other format, for instance, in a string format like &lt;strong&gt;na&lt;/strong&gt; or &lt;strong&gt;missing&lt;/strong&gt;, we should tell R that these are missing values to be converted to &lt;strong&gt;NA&lt;/strong&gt;. Or like in our case when a variable takes zero value while it should not have to take it. The &lt;strong&gt;Amount&lt;/strong&gt; variable, for instance, we know that any transaction requires some amount of money so that it should not be equal to zero, while in the data it has 1825 zero’s. the same thing applies for the &lt;strong&gt;Time&lt;/strong&gt; variable with two zero’s. However, since the data is large then this is not a big issue, and we can comfortably remove these rows.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;card$Amount &amp;lt;- h2o.ifelse(card$Amount == 0, NA, card$Amount)
card$Time &amp;lt;- h2o.ifelse(card$Time == 0, NA, card$Time)
card &amp;lt;- h2o.na_omit(card)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;it is a good practice to check your output after each transformation to make sure your code did what would be expected.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;knitr::kable(h2o.describe(card))&lt;/code&gt;&lt;/pre&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr class=&#34;header&#34;&gt;
&lt;th align=&#34;left&#34;&gt;Label&lt;/th&gt;
&lt;th align=&#34;left&#34;&gt;Type&lt;/th&gt;
&lt;th align=&#34;right&#34;&gt;Missing&lt;/th&gt;
&lt;th align=&#34;right&#34;&gt;Zeros&lt;/th&gt;
&lt;th align=&#34;right&#34;&gt;PosInf&lt;/th&gt;
&lt;th align=&#34;right&#34;&gt;NegInf&lt;/th&gt;
&lt;th align=&#34;right&#34;&gt;Min&lt;/th&gt;
&lt;th align=&#34;right&#34;&gt;Max&lt;/th&gt;
&lt;th align=&#34;right&#34;&gt;Mean&lt;/th&gt;
&lt;th align=&#34;right&#34;&gt;Sigma&lt;/th&gt;
&lt;th align=&#34;right&#34;&gt;Cardinality&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;left&#34;&gt;Time&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;int&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;1.000000&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;1.727920e+05&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;94849.6338858&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;4.748196e+04&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;NA&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;left&#34;&gt;V1&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;real&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;-56.407510&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;2.454930e+00&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;-0.0003483&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;1.956753e+00&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;NA&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;left&#34;&gt;V2&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;real&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;-72.715728&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;2.205773e+01&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;-0.0020179&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;1.650496e+00&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;NA&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;left&#34;&gt;V3&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;real&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;-48.325589&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;9.382558e+00&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;-0.0033027&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;1.514214e+00&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;NA&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;left&#34;&gt;V4&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;real&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;-5.683171&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;1.687534e+01&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;-0.0119933&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;1.404852e+00&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;NA&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;left&#34;&gt;V5&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;real&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;-113.743307&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;3.480167e+01&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;-0.0022396&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;1.378819e+00&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;NA&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;left&#34;&gt;V6&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;real&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;-26.160506&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;7.330163e+01&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;-0.0013051&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;1.331596e+00&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;NA&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;left&#34;&gt;V7&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;real&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;-43.557242&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;1.205895e+02&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0.0025090&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;1.233944e+00&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;NA&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;left&#34;&gt;V8&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;real&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;-73.216718&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;2.000721e+01&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0.0000269&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;1.191177e+00&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;NA&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;left&#34;&gt;V9&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;real&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;-13.320155&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;1.559499e+01&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0.0014642&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;1.099065e+00&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;NA&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;left&#34;&gt;V10&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;real&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;-24.588262&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;2.374514e+01&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;-0.0022783&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;1.087587e+00&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;NA&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;left&#34;&gt;V11&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;real&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;-4.797473&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;1.201891e+01&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0.0023114&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;1.018693e+00&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;NA&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;left&#34;&gt;V12&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;real&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;-18.683715&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;7.848392e+00&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0.0008656&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;9.972279e-01&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;NA&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;left&#34;&gt;V13&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;real&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;-5.791881&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;7.126883e+00&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0.0006992&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;9.945502e-01&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;NA&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;left&#34;&gt;V14&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;real&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;-19.214326&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;1.052677e+01&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0.0002020&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;9.555395e-01&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;NA&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;left&#34;&gt;V15&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;real&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;-4.498945&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;8.877742e+00&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0.0036456&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;9.137113e-01&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;NA&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;left&#34;&gt;V16&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;real&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;-14.129855&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;1.731511e+01&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;-0.0010958&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;8.760560e-01&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;NA&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;left&#34;&gt;V17&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;real&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;-25.162799&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;9.253526e+00&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0.0016190&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;8.462568e-01&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;NA&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;left&#34;&gt;V18&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;real&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;-9.498746&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;5.041069e+00&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0.0013067&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;8.386969e-01&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;NA&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;left&#34;&gt;V19&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;real&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;-7.213527&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;5.591971e+00&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0.0019147&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;8.119902e-01&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;NA&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;left&#34;&gt;V20&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;real&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;-54.497720&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;3.942090e+01&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0.0009807&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;7.705625e-01&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;NA&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;left&#34;&gt;V21&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;real&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;-34.830382&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;2.720284e+01&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;-0.0000481&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;7.326525e-01&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;NA&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;left&#34;&gt;V22&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;real&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;-10.933144&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;1.050309e+01&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;-0.0016073&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;7.255767e-01&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;NA&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;left&#34;&gt;V23&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;real&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;-44.807735&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;2.252841e+01&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0.0001474&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;6.230342e-01&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;NA&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;left&#34;&gt;V24&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;real&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;-2.836627&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;4.584549e+00&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0.0002018&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;6.057968e-01&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;NA&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;left&#34;&gt;V25&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;real&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;-10.295397&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;7.519589e+00&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;-0.0005087&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;5.209869e-01&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;NA&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;left&#34;&gt;V26&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;real&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;-2.604551&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;3.517346e+00&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;-0.0013648&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;4.819297e-01&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;NA&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;left&#34;&gt;V27&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;real&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;-22.565679&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;3.161220e+01&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0.0002533&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;4.029874e-01&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;NA&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;left&#34;&gt;V28&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;real&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;-15.430084&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;3.384781e+01&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0.0001927&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;3.303524e-01&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;NA&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;left&#34;&gt;Amount&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;real&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0.010000&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;2.569116e+04&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;88.9194915&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;2.508252e+02&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;NA&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;left&#34;&gt;Class&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;int&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;282515&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0.000000&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;1.000000e+00&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0.0016432&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;4.050350e-02&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;NA&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;In contrast, we have a very serious imbalance problem since the &lt;strong&gt;class&lt;/strong&gt; variable, with only two values 1 and 0, has its mean equals 0.00173 which means that we have a large number of class label 0.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;h2o.table(card$Class)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;  Class  Count
1     0 282515
2     1    465

[2 rows x 2 columns] &lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;As expected, the majority of cases are of class label 0. Any machine learning model fitted to this data without correcting this problem will be dominated by the label 0, and will hardly correctly predict the fraudulent card (label 1) which is our main interest.&lt;/p&gt;
&lt;p&gt;The h2o package provides a way to correct the imbalance problem. For &lt;strong&gt;glm&lt;/strong&gt; models, for instance, we have three arguments for this purpose:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;balance_classes: if it is set to true then it performs subsampling method by default, or specified in the next argument.&lt;/li&gt;
&lt;li&gt;class_sampling_factors: The desired sampling ratios per class (over or under-sampling).&lt;/li&gt;
&lt;li&gt;max_after_balance_size: The desired relative size of the training data after balancing class counts.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Before going ahead we should split the data randomly between training (80% of the data) and testing set (the rest 20%).&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;card$Class &amp;lt;- h2o.asfactor(card$Class)

parts &amp;lt;- h2o.splitFrame(card, 0.8, seed = 1111)
train &amp;lt;- parts[[1]]
test &amp;lt;- parts[[2]]&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;h2o.table(train$Class)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;  Class  Count
1     0 226268
2     1    372

[2 rows x 2 columns] &lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;h2o.table(test$Class)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;  Class Count
1     0 56247
2     1    93

[2 rows x 2 columns] &lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div id=&#34;logistic-regression&#34; class=&#34;section level1&#34; number=&#34;3&#34;&gt;
&lt;h1&gt;&lt;span class=&#34;header-section-number&#34;&gt;3&lt;/span&gt; Logistic regression&lt;/h1&gt;
&lt;p&gt;For binary classification problems, the first model that comes in mind is the logistic regression model. This model belongs to the &lt;strong&gt;glm&lt;/strong&gt; models such that when we set the argument family to &lt;strong&gt;binomial&lt;/strong&gt; we get a logistic regression model. The following are the main arguments of **glm* models (besides the arguments discussed above):&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;x: should contains the predictor names (not the data) or their indices.&lt;/li&gt;
&lt;li&gt;y: the name of the response variable (again not the whole column).&lt;/li&gt;
&lt;li&gt;training frame: The training data frame.&lt;/li&gt;
&lt;li&gt;model_id: to name the model.&lt;/li&gt;
&lt;li&gt;nfolds: the number of folds to use for cross-validation for hyperparameters tuning.&lt;/li&gt;
&lt;li&gt;seed: for reproducibility.&lt;/li&gt;
&lt;li&gt;fold_assignment: the skim of the cross-validation: AUTO, Random, Stratified, or Modulo.&lt;/li&gt;
&lt;li&gt;family: many distributions are provided, for binary we have &lt;strong&gt;binomial&lt;/strong&gt;, &lt;strong&gt;quasibinomial&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;solver: the algorithm used, with &lt;strong&gt;AUTO&lt;/strong&gt;, will decide the best one given the data, but you can choose another one like IRLSM, L_BFGS, COORDINATE_DESCENT, …etc.&lt;/li&gt;
&lt;li&gt;alpha: ratio to mix the regularization L1 (lasso) and L2(ridge regression). larger values yield more lasso.&lt;/li&gt;
&lt;li&gt;lambda_search: lambda is the strength of the L2 regularization. If TRUE then the model tries different values.&lt;/li&gt;
&lt;li&gt;standardize: to standardize the numeric columns.&lt;/li&gt;
&lt;li&gt;compute_p_value: it does not work with regularization.&lt;/li&gt;
&lt;li&gt;link: the link function.&lt;/li&gt;
&lt;li&gt;interaction: if we want interaction between predictors.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Now we are ready to train our model with some specified values. But first, let’s try to use the original data without correcting the imbalance problem.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;model_logit &amp;lt;- h2o.glm(
  x = 1:30,
  y = 31,
  training_frame = train,
  model_id = &amp;quot;glm_binomial_no_eg&amp;quot;,
  seed = 123,
  lambda = 0,
  family = &amp;quot;binomial&amp;quot;,
  solver = &amp;quot;IRLSM&amp;quot;,
  standardize = TRUE,
  link = &amp;quot;family_default&amp;quot;
)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;
  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |===                                                                   |   4%
  |                                                                            
  |=======                                                               |  10%
  |                                                                            
  |======================================================================| 100%&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;h2o provides a bunch of metrics already computed during the training process along with the confusion matrix. we can get access to them by calling the function &lt;strong&gt;h2O.performance&lt;/strong&gt;.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;h2o.performance(model_logit)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;H2OBinomialMetrics: glm
** Reported on training data. **

MSE:  0.0006269349
RMSE:  0.02503867
LogLoss:  0.003809522
Mean Per-Class Error:  0.1103587
AUC:  0.9731273
AUCPR:  0.7485898
Gini:  0.9462545
R^2:  0.6174137
Residual Deviance:  1726.78
AIC:  1788.78

Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
            0   1    Error         Rate
0      226203  65 0.000287   =65/226268
1          82 290 0.220430      =82/372
Totals 226285 355 0.000649  =147/226640

Maximum Metrics: Maximum metrics at their respective thresholds
                        metric threshold         value idx
1                       max f1  0.135889      0.797799 176
2                       max f2  0.058169      0.808081 200
3                 max f0point5  0.475561      0.833333 102
4                 max accuracy  0.135889      0.999351 176
5                max precision  0.999976      0.909091   0
6                   max recall  0.000027      1.000000 397
7              max specificity  0.999976      0.999960   0
8             max absolute_mcc  0.135889      0.797693 176
9   max min_per_class_accuracy  0.001118      0.919355 345
10 max mean_per_class_accuracy  0.002782      0.934336 314
11                     max tns  0.999976 226259.000000   0
12                     max fns  0.999976    282.000000   0
13                     max fps  0.000007 226268.000000 399
14                     max tps  0.000027    372.000000 397
15                     max tnr  0.999976      0.999960   0
16                     max fnr  0.999976      0.758065   0
17                     max fpr  0.000007      1.000000 399
18                     max tpr  0.000027      1.000000 397

Gains/Lift Table: Extract with `h2o.gainsLift(&amp;lt;model&amp;gt;, &amp;lt;data&amp;gt;)` or `h2o.gainsLift(&amp;lt;model&amp;gt;, valid=&amp;lt;T/F&amp;gt;, xval=&amp;lt;T/F&amp;gt;)`&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;To extract only the confusion matrix we call the function &lt;strong&gt;h2O.confusionMatrix&lt;/strong&gt;&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;h2o.confusionMatrix(model_logit)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;Confusion Matrix (vertical: actual; across: predicted)  for max f1 @ threshold = 0.135888872638703:
            0   1    Error         Rate
0      226203  65 0.000287   =65/226268
1          82 290 0.220430      =82/372
Totals 226285 355 0.000649  =147/226640&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;By looking at the confusion matrix, we get a very low error rate for the major label (0.029%), whereas, the error rate for the minor label is quite high (22.04%). This result is expected since the data is highly dominated by the label “0”.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;h2o.confusionMatrix(model_logit, test)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;Confusion Matrix (vertical: actual; across: predicted)  for max f1 @ threshold = 0.0767397449673996:
           0  1    Error       Rate
0      56223 24 0.000427  =24/56247
1         19 74 0.204301     =19/93
Totals 56242 98 0.000763  =43/56340&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Using the testing set, the error rate of the major class is a little larger than its corresponding one for the training data 0.043%. Whereas, the error rate of the minor class is smaller than its corresponding one 20.43% (22.04%).&lt;/p&gt;
&lt;p&gt;We can correct the imbalance problem by setting the argument &lt;strong&gt;balance_classes&lt;/strong&gt; to TRUE. Unfortunately, I trained many times this model but it seemed this argument does not work for some reason. I do not know this problem occurs in this version of h20 for everyone or just for me due to some problems with my laptop. Anyway, I put an issue in &lt;strong&gt;stackoverflow&lt;/strong&gt; about it but I do not get yet any answer at the time of writing.&lt;/p&gt;
&lt;p&gt;we can correct the imbalance problem by loading the data as data frame into R, and using &lt;strong&gt;Rose&lt;/strong&gt; package then converting back the corrected data to h2o object.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Note&lt;/strong&gt;: This possibility of loading data from h2o to R will not be always possible for a very large dataset. I am using this alternative only to carry on our analysis and do not get stacked.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;train_R &amp;lt;- as.data.frame(train)
train_balance &amp;lt;- ROSE::ROSE(Class~., data=train_R, seed=111)$data
table(train_balance$Class)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;
     0      1 
113244 113396 &lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Now we feed this corrected data to our model again after converting it back to h2o.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;train_h &amp;lt;- as.h2o(train_balance)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;
  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |======================================================================| 100%&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;model_logit2 &amp;lt;- h2o.glm(
  x = 1:30,
  y = 31,
  training_frame = train_h,
  model_id = &amp;quot;glm_binomial_balance&amp;quot;,
  seed = 123,
  lambda = 0,
  family = &amp;quot;binomial&amp;quot;,
  solver = &amp;quot;IRLSM&amp;quot;,
  standardize = TRUE,
  link = &amp;quot;family_default&amp;quot;
)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;
  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |===========                                                           |  16%
  |                                                                            
  |======================================================================| 100%&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;We can check the confusion matrix as follows.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;h2o.confusionMatrix(model_logit2)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;Confusion Matrix (vertical: actual; across: predicted)  for max f1 @ threshold = 0.448958365921134:
            0      1    Error           Rate
0      110591   2653 0.023427   =2653/113244
1       12594 100802 0.111062  =12594/113396
Totals 123185 103455 0.067274  =15247/226640&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;As the reliable measure of the model performance is the unseen data, so let’s use our testing set.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;h2o.confusionMatrix(model_logit2, test)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;Confusion Matrix (vertical: actual; across: predicted)  for max f1 @ threshold = 0.9289188100923:
           0   1    Error       Rate
0      56200  47 0.000836  =47/56247
1         16  77 0.172043     =16/93
Totals 56216 124 0.001118  =63/56340&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Since we are more interested to the minor class so we will consider an improvement if getting lower rate for the minor class. After correcting the class imbalance problem, The minor class rate has reduced from 20.43% to 17.20%.&lt;/p&gt;
&lt;p&gt;One strategy to improve our model is to remove the less important variables by hand using a threshold. h2o provides a function to list the predictors in decreasing order of their importance in predicting the response variable. So we can think to remove the less important variable with the hope to reduce the error rate of the minor class.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;h2o.varimp(model_logit)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;   variable relative_importance scaled_importance   percentage
1        V4         0.994893006       1.000000000 0.1440671995
2       V10         0.916557351       0.921262231 0.1327236697
3       V14         0.532427886       0.535160950 0.0770991394
4       V22         0.481085303       0.483554815 0.0696643880
5        V9         0.371758112       0.373666424 0.0538330752
6       V20         0.360902368       0.362754956 0.0522610906
7       V27         0.340929107       0.342679168 0.0493688281
8       V13         0.324390153       0.326055315 0.0469738762
9       V21         0.309050873       0.310637296 0.0447526453
10      V16         0.230524987       0.231708320 0.0333815688
11   Amount         0.213887758       0.214985688 0.0309723861
12       V8         0.211491780       0.212577411 0.0306254323
13     Time         0.209412404       0.210487362 0.0303243248
14       V6         0.189761276       0.190735361 0.0274787093
15       V5         0.176081346       0.176985209 0.0254977634
16       V1         0.164618852       0.165463875 0.0238379171
17      V12         0.134542774       0.135233410 0.0194826987
18      V11         0.129031560       0.129693906 0.0186846379
19      V28         0.093633317       0.094113956 0.0135587341
20      V26         0.093283287       0.093762130 0.0135080475
21      V17         0.081893193       0.082313568 0.0118586852
22       V7         0.077962451       0.078362649 0.0112894873
23      V23         0.067840817       0.068189058 0.0098238066
24      V18         0.065741510       0.066078975 0.0095198129
25      V25         0.033325258       0.033496323 0.0048257215
26       V2         0.029047974       0.029197083 0.0042063420
27      V24         0.025833162       0.025965769 0.0037408156
28      V19         0.022354254       0.022469003 0.0032370463
29      V15         0.020189854       0.020293493 0.0029236267
30       V3         0.003304571       0.003321534 0.0004785241&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Or as plot as follows:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;h2o.varimp_plot(model_logit)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://modelingwithr.rbind.io/sparklyr/h2o/2020-06-03-predicting-binary-response-variable-with-h2o-framework.en_files/figure-html/unnamed-chunk-21-1.svg&#34; width=&#34;576&#34; /&gt;&lt;/p&gt;
&lt;p&gt;Another strategy to remove the less important variables, which is better, is by using the lasso regression (L1) that can strip out the less important ones automatically, known also as a feature selection method. Lasso, like ridge regression (L2), is a regularization technique to fight overfitting problems, and besides that, it is also known as a reduction technique since it reduces the number of predictors. We enable this method in h2o by setting &lt;code&gt;alpha=1&lt;/code&gt;, where &lt;strong&gt;alpha&lt;/strong&gt; is a ratio to the trade-off between lasso (L1) or ridge regression (L2). alpha closer to zero means more ridge than lasso.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;model_lasso &amp;lt;- h2o.glm(
  x = 1:30,
  y = 31,
  training_frame = train_h,
  model_id = &amp;quot;glm_binomial_lasso&amp;quot;,
  seed = 123,
  alpha = 1,
  family = &amp;quot;binomial&amp;quot;,
  solver = &amp;quot;IRLSM&amp;quot;,
  standardize = TRUE,
  link = &amp;quot;family_default&amp;quot;
)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;
  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |====                                                                  |   6%
  |                                                                            
  |======================================================================| 100%&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;h2o.confusionMatrix(model_lasso)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;Confusion Matrix (vertical: actual; across: predicted)  for max f1 @ threshold = 0.429938956856689:
            0      1    Error           Rate
0      110315   2929 0.025865   =2929/113244
1       12339 101057 0.108813  =12339/113396
Totals 122654 103986 0.067367  =15268/226640&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Using the testing set, the confusion matrix will be:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;h2o.confusionMatrix(model_lasso, test)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;Confusion Matrix (vertical: actual; across: predicted)  for max f1 @ threshold = 0.958116537926135:
           0   1    Error       Rate
0      56210  37 0.000658  =37/56247
1         20  73 0.215054     =20/93
Totals 56230 110 0.001012  =57/56340&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;With the lasso model, the error rate of the minor class has increased from 17.20% to 21.50%, which is in contradiction with the improvement recorded in the rate computed from the training data where the rate has decreased from 11.10% to 10.88% with lasso model.&lt;/p&gt;
&lt;p&gt;The last thing about hyperparameters tuning is that some of which are not supported by &lt;strong&gt;h2o.grid&lt;/strong&gt; function like, for instance, the &lt;strong&gt;solver&lt;/strong&gt; argument. But this not an issue since we can recycle a loop over the hyperparameters in question. Let’s try to explore the most popular solvers by using the R lapply function.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;solvers &amp;lt;- c(
  &amp;quot;IRLSM&amp;quot;,
  &amp;quot;L_BFGS&amp;quot;,
  &amp;quot;COORDINATE_DESCENT&amp;quot;
)

mygrid &amp;lt;- lapply(solvers, function(solver) {
  grid_id &amp;lt;- paste0(&amp;quot;glm_&amp;quot;, solver)
  h2o.glm(
    x = 1:30,
    y = 31,
    training_frame = train_h,
    seed = 123,
    model_id = paste0(&amp;quot;logit_&amp;quot;, solver),
    family = &amp;quot;binomial&amp;quot;,
    solver = solver,
    standardize = TRUE,
    link = &amp;quot;family_default&amp;quot;
  )
   
})&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;
  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |=                                                                     |   2%
  |                                                                            
  |===========                                                           |  16%
  |                                                                            
  |======================================================================| 100%

  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |======================================================================| 100%

  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |=                                                                     |   2%
  |                                                                            
  |=============                                                         |  18%
  |                                                                            
  |======================================================================| 100%&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;df &amp;lt;- cbind(
  h2o.confusionMatrix(mygrid[[1]])$Error,
  h2o.confusionMatrix(mygrid[[2]])$Error,
  h2o.confusionMatrix(mygrid[[3]])$Error
)
df &amp;lt;- t(round(df, digits = 6))
dimnames(df) &amp;lt;- list(
  list(&amp;quot;IRLSM&amp;quot;, &amp;quot;L_BFGS&amp;quot;,  &amp;quot;COORDINATE_DESCENT&amp;quot;),
  list(&amp;quot;Error (0)&amp;quot;, &amp;quot;Error (1)&amp;quot;, &amp;quot;Total Error&amp;quot;)
  
)
df&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;                   Error (0) Error (1) Total Error
IRLSM               0.024169  0.110313    0.067270
L_BFGS              0.024354  0.110189    0.067301
COORDINATE_DESCENT  0.025909  0.109051    0.067508&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;It seems there is no significant difference between these solvers. If we focus, however, on the error of the minor class, it seems that the &lt;strong&gt;COORDINATE_DESCT&lt;/strong&gt; is the best one with the lowest error. But it can be the result of random chances since we did not use cross-validation.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;random-forest&#34; class=&#34;section level1&#34; number=&#34;4&#34;&gt;
&lt;h1&gt;&lt;span class=&#34;header-section-number&#34;&gt;4&lt;/span&gt; Random forest&lt;/h1&gt;
&lt;p&gt;The random forest model is the most popular machine learning model due to its capability to capture even complex patterns in the data. This is also, however, can be considered at the same time as a downside, since this capability tends to exceedingly memorize everything in the data including the noise, which gives rise to the overfitting problem. That is why this model has a large number of hyperparameters for regularization techniques, among others, to control the training process.
The main hyperparameters provided by h2O are the following&lt;a href=&#34;#fn1&#34; class=&#34;footnote-ref&#34; id=&#34;fnref1&#34;&gt;&lt;sup&gt;1&lt;/sup&gt;&lt;/a&gt; :&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;seed: for reproducibility.&lt;/li&gt;
&lt;li&gt;ntrees: The number of trees used (called also iterations). The default is 50.&lt;/li&gt;
&lt;li&gt;max_depth: The maximum level allowed for each tree. The default is 20.&lt;/li&gt;
&lt;li&gt;mtries: The number of the columns chosen randomly for each tree. The default is &lt;span class=&#34;math inline&#34;&gt;\(\sqrt{p}\)&lt;/span&gt; for classification, and &lt;span class=&#34;math inline&#34;&gt;\(\frac{p}{3}\)&lt;/span&gt; for regression (where p is the number of columns).&lt;/li&gt;
&lt;li&gt;sample_rate: the proportion of the training data selected randomly at each tree. The default is 63.2%.&lt;/li&gt;
&lt;li&gt;balance_classes: This the most important hyperparameters for our data, since it is highly imbalanced. The default is false, if set to true then the model will correct this problem by making use of over/under sampling methods.&lt;/li&gt;
&lt;li&gt;min_rows: the minimum number of instances in a node to allow for splitting this node. the default is 1.&lt;/li&gt;
&lt;li&gt;min_split_improvement: The minimum error reduction to make further splitting. The default is 0.&lt;/li&gt;
&lt;li&gt;binomial_double_trees: for binary classification. If true then the model two random forests, one for each output class. this method cn give high accuracy with the cost of doubling the computation time.&lt;/li&gt;
&lt;li&gt;stopping_rounds: The number of iterations required to early stopping the training process if the moving average of the stopping_metric (based on this number of iterations) does not improve. The default is 0, which means the early stopping is disabled.&lt;/li&gt;
&lt;li&gt;stopping_metric: works with the last argument. The default is AUTO, that is the &lt;strong&gt;logloss&lt;/strong&gt; for classification, &lt;strong&gt;deviance&lt;/strong&gt; for regression, but we have also &lt;strong&gt;MSE&lt;/strong&gt;, &lt;strong&gt;RMSE&lt;/strong&gt;, &lt;strong&gt;MAE&lt;/strong&gt;, &lt;strong&gt;AUC&lt;/strong&gt;, &lt;strong&gt;misclasssification&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;stopping_tolerance: The threshold under which we consider no improvement. The default is 0.001.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;First let’s try this model with the default values, except for balance_classes that we set to true. Fortunately, unlike glm models, this argument works fine with random forest model.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;model_rf &amp;lt;- h2o.randomForest(
  x = 1:30,
  y = 31,
  training_frame = train,
  seed = 123,
  model_id = &amp;quot;rf_default&amp;quot;,
  balance_classes = TRUE
)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;
  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |=                                                                     |   2%
  |                                                                            
  |===                                                                   |   4%
  |                                                                            
  |====                                                                  |   6%
  |                                                                            
  |=======                                                               |  10%
  |                                                                            
  |==========                                                            |  14%
  |                                                                            
  |=============                                                         |  18%
  |                                                                            
  |===============                                                       |  22%
  |                                                                            
  |==================                                                    |  26%
  |                                                                            
  |=====================                                                 |  30%
  |                                                                            
  |========================                                              |  34%
  |                                                                            
  |===========================                                           |  38%
  |                                                                            
  |============================                                          |  40%
  |                                                                            
  |================================                                      |  46%
  |                                                                            
  |===================================                                   |  50%
  |                                                                            
  |====================================                                  |  52%
  |                                                                            
  |=======================================                               |  56%
  |                                                                            
  |==========================================                            |  60%
  |                                                                            
  |=============================================                         |  64%
  |                                                                            
  |==============================================                        |  66%
  |                                                                            
  |==================================================                    |  72%
  |                                                                            
  |=====================================================                 |  76%
  |                                                                            
  |========================================================              |  80%
  |                                                                            
  |=========================================================             |  82%
  |                                                                            
  |============================================================          |  86%
  |                                                                            
  |===============================================================       |  90%
  |                                                                            
  |===================================================================   |  96%
  |                                                                            
  |======================================================================| 100%&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Now we check how this model did with the training data.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;h2o.performance(model_rf)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;H2OBinomialMetrics: drf
** Reported on training data. **
** Metrics reported on Out-Of-Bag training samples **

MSE:  0.03503944
RMSE:  0.1871882
LogLoss:  0.1012676
Mean Per-Class Error:  6.629307e-06
AUC:  0.999995
AUCPR:  0.9999938
Gini:  0.9999901
R^2:  0.8598422

Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
            0      1    Error       Rate
0      226265      3 0.000013  =3/226268
1           0 226262 0.000000  =0/226262
Totals 226265 226265 0.000007  =3/452530

Maximum Metrics: Maximum metrics at their respective thresholds
                        metric threshold         value idx
1                       max f1  0.060268      0.999993 397
2                       max f2  0.060268      0.999997 397
3                 max f0point5  0.060268      0.999989 397
4                 max accuracy  0.060268      0.999993 397
5                max precision  1.000000      1.000000   0
6                   max recall  0.060268      1.000000 397
7              max specificity  1.000000      1.000000   0
8             max absolute_mcc  0.060268      0.999987 397
9   max min_per_class_accuracy  0.060268      0.999987 397
10 max mean_per_class_accuracy  0.060268      0.999993 397
11                     max tns  1.000000 226268.000000   0
12                     max fns  1.000000 132282.000000   0
13                     max fps  0.000002 226268.000000 399
14                     max tps  0.060268 226262.000000 397
15                     max tnr  1.000000      1.000000   0
16                     max fnr  1.000000      0.584641   0
17                     max fpr  0.000002      1.000000 399
18                     max tpr  0.060268      1.000000 397

Gains/Lift Table: Extract with `h2o.gainsLift(&amp;lt;model&amp;gt;, &amp;lt;data&amp;gt;)` or `h2o.gainsLift(&amp;lt;model&amp;gt;, valid=&amp;lt;T/F&amp;gt;, xval=&amp;lt;T/F&amp;gt;)`&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Surprisingly, the model is almost perfect with 0.0007% overall error rate, which is very suspicious, since this model memorized everything even the noisy patterns. The real challenge for every model is how it generalizes to unseen data, that is why we should always hold out some data as testing data to test the model performance.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;h2o.confusionMatrix(model_rf, test)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;Confusion Matrix (vertical: actual; across: predicted)  for max f1 @ threshold = 0.00208813635434166:
           0  1    Error       Rate
0      56240  7 0.000124   =7/56247
1         16 77 0.172043     =16/93
Totals 56256 84 0.000408  =23/56340&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;As expected the model overfitted the data. The error rate of the minor class is now very large which is the same as that obtained from the lasso model.&lt;/p&gt;
&lt;div id=&#34;random-forest-with-binomial-double-trees&#34; class=&#34;section level2&#34; number=&#34;4.1&#34;&gt;
&lt;h2&gt;&lt;span class=&#34;header-section-number&#34;&gt;4.1&lt;/span&gt; Random forest with binomial double trees&lt;/h2&gt;
&lt;p&gt;Before going ahead with hyperparameters tuning, let’s try the binomial double trees technique discussed above.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;model_rf_dbl &amp;lt;- h2o.randomForest(
  x = 1:30,
  y = 31,
  training_frame = train,
  seed = 123,
  model_id = &amp;quot;rf_default&amp;quot;,
  binomial_double_trees = TRUE 
)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;
  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |=                                                                     |   2%
  |                                                                            
  |====                                                                  |   6%
  |                                                                            
  |=======                                                               |  10%
  |                                                                            
  |==========                                                            |  14%
  |                                                                            
  |==============                                                        |  20%
  |                                                                            
  |=================                                                     |  24%
  |                                                                            
  |=====================                                                 |  30%
  |                                                                            
  |========================                                              |  34%
  |                                                                            
  |============================                                          |  40%
  |                                                                            
  |================================                                      |  46%
  |                                                                            
  |====================================                                  |  52%
  |                                                                            
  |=========================================                             |  58%
  |                                                                            
  |=============================================                         |  64%
  |                                                                            
  |==================================================                    |  72%
  |                                                                            
  |=====================================================                 |  76%
  |                                                                            
  |=========================================================             |  82%
  |                                                                            
  |==============================================================        |  88%
  |                                                                            
  |==================================================================    |  94%
  |                                                                            
  |======================================================================| 100%&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;h2o.confusionMatrix(model_rf_dbl)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;Confusion Matrix (vertical: actual; across: predicted)  for max f1 @ threshold = 0.473692341854698:
            0   1    Error        Rate
0      226258  10 0.000044  =10/226268
1          83 289 0.223118     =83/372
Totals 226341 299 0.000410  =93/226640&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;h2o.confusionMatrix(model_rf_dbl, test)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;Confusion Matrix (vertical: actual; across: predicted)  for max f1 @ threshold = 0.42:
           0  1    Error       Rate
0      56239  8 0.000142   =8/56247
1         13 80 0.139785     =13/93
Totals 56252 88 0.000373  =21/56340&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;As we see, this model is the best one until now with the lowest rate for the minor class at 13.98%.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;random-forest-tuning&#34; class=&#34;section level2&#34; number=&#34;4.2&#34;&gt;
&lt;h2&gt;&lt;span class=&#34;header-section-number&#34;&gt;4.2&lt;/span&gt; Random forest tuning&lt;/h2&gt;
&lt;p&gt;We can try to tune the hyperparameters related to the regularization techniques to fight the overfitting problem. For instance, we use lower values for max_depth and larger values for min_rows to prune the trees, lower values for sample_rate to let each tree focus on a small part of the training data. We set also some values to early stop the training process if we do not obtain significant improvement. Finally, to avoid the randomness of the results we use cross-validation.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;#model_rftuned &amp;lt;- h2o.grid(
#  &amp;quot;randomForest&amp;quot;,
#  hyper_params = list(
#    max_depth = c(5, 10),
#    min_rows = c(10, 20, 30),
#    sample_rate = c(0.3, 0.5)
#  ),
# stopping_rounds = 5,
#  stopping_metric = &amp;quot;AUTO&amp;quot;,
#  stopping_tolerance = 0.001,
#  balance_classes = TRUE,
#  nfolds = 5,
#  fold_assignment = &amp;quot;Stratified&amp;quot;,
#  x = 1:30,
#  y = 31,
#  training_frame = train
#)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Since this model took a lot of time I saved the following output in csv file then I loaded it again.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;#df_output &amp;lt;- model_rftuned@summary_table %&amp;gt;% 
#  select(max_depth, min_rows, sample_rate, logloss) %&amp;gt;% 
#  arrange(logloss)
#write.csv(df_output, &amp;quot;df_output.csv&amp;quot;,  row.names = F)
df_output &amp;lt;- read.csv(&amp;quot;df_output.csv&amp;quot;)
knitr::kable(df_output)&lt;/code&gt;&lt;/pre&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr class=&#34;header&#34;&gt;
&lt;th align=&#34;right&#34;&gt;max_depth&lt;/th&gt;
&lt;th align=&#34;right&#34;&gt;min_rows&lt;/th&gt;
&lt;th align=&#34;right&#34;&gt;sample_rate&lt;/th&gt;
&lt;th align=&#34;right&#34;&gt;logloss&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;right&#34;&gt;10&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;30&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0.3&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0.0041177&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;right&#34;&gt;10&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;20&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0.3&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0.0041834&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;right&#34;&gt;10&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;30&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0.5&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0.0043959&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;right&#34;&gt;5&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;30&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0.3&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0.0044893&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;right&#34;&gt;10&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;10&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0.3&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0.0045269&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;right&#34;&gt;5&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;20&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0.5&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0.0045655&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;right&#34;&gt;10&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;10&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0.5&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0.0045780&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;right&#34;&gt;10&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;20&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0.5&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0.0046402&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;right&#34;&gt;5&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;20&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0.3&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0.0046463&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;right&#34;&gt;5&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;10&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0.3&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0.0046960&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;right&#34;&gt;5&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;30&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0.5&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0.0047160&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;right&#34;&gt;5&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;10&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0.5&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0.0047175&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;Using the logloss metric, the best model is obtained with 10 for max_depth, 30 for min_rows, and the sample rate is about 0.3. Now let’s run this model with these values.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;model_rf_best &amp;lt;- h2o.randomForest(
  x = 1:30,
  y = 31,
  training_frame = train,
  seed = 123,
  model_id = &amp;quot;rf_best&amp;quot;,
  max_depth = 10,
  min_rows = 30,
  sample_rate = 0.3,
  stopping_rounds = 5,
  stopping_metric = &amp;quot;AUTO&amp;quot;,
  stopping_tolerance = 0.001,
  balance_classes = TRUE
)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;
  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |=                                                                     |   2%
  |                                                                            
  |======                                                                |   8%
  |                                                                            
  |=======                                                               |  10%
  |                                                                            
  |===========                                                           |  16%
  |                                                                            
  |=================                                                     |  24%
  |                                                                            
  |======================                                                |  32%
  |                                                                            
  |============================                                          |  40%
  |                                                                            
  |================================                                      |  46%
  |                                                                            
  |======================================                                |  54%
  |                                                                            
  |===========================================                           |  62%
  |                                                                            
  |================================================                      |  68%
  |                                                                            
  |=====================================================                 |  76%
  |                                                                            
  |===========================================================           |  84%
  |                                                                            
  |================================================================      |  92%
  |                                                                            
  |===================================================================== |  98%
  |                                                                            
  |======================================================================| 100%&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;h2o.confusionMatrix(model_rf_best)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;Confusion Matrix (vertical: actual; across: predicted)  for max f1 @ threshold = 0.00180477647421117:
            0      1    Error          Rate
0      225985    283 0.001251   =283/226268
1        2582 223680 0.011412  =2582/226262
Totals 228567 223963 0.006331  =2865/452530&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The model did well with the training data. But what about the testing set?.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;h2o.confusionMatrix(model_rf_best, test)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;Confusion Matrix (vertical: actual; across: predicted)  for max f1 @ threshold = 0.00496959295307637:
           0   1    Error       Rate
0      56226  21 0.000373  =21/56247
1         13  80 0.139785     =13/93
Totals 56239 101 0.000603  =34/56340&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;With this model, we get the same error rate for the minor class as the binomial double trees model. But for the overall error rate, the latter is better than the former.&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div id=&#34;deep-learning-model&#34; class=&#34;section level1&#34; number=&#34;5&#34;&gt;
&lt;h1&gt;&lt;span class=&#34;header-section-number&#34;&gt;5&lt;/span&gt; Deep learning model&lt;/h1&gt;
&lt;p&gt;Deep learning models are known for their high accuracy at predicting very large and complex datasets. They have a large number of hyperparameters that can be tuned to efficiently handle a wide range of datasets. Tuning a large number of hyperparameters with large datasets, however, requires very large hardware resources and time, which is not always available or very costly (using the cloud providers’ platforms). That is why this type of model requires quite high experience and practice to be able to correctly set the right hyperparameter values.&lt;/p&gt;
&lt;p&gt;There are many frameworks for deep learning models. The most used ones are &lt;strong&gt;tensorflow&lt;/strong&gt; and &lt;strong&gt;keras&lt;/strong&gt; since they are designed specifically for this type of models and can handle almost all the famous architectures such as: &lt;strong&gt;feedforward neural network&lt;/strong&gt;, &lt;strong&gt;convolutional neural network&lt;/strong&gt;, &lt;strong&gt;recurrent neural network&lt;/strong&gt;,..etc. Besides, they can also provide us with some tools to define our architecture.&lt;/p&gt;
&lt;p&gt;For h2o, it provides only the &lt;strong&gt;feedforward neural network&lt;/strong&gt; which is densely connected layers. However, this type of architecture is the most used one in economics.
We can briefly discuss the main hyperparameters provided by h2O for this type of models (in addition to some of the above hyperparameters):&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;hidden&lt;/strong&gt;: we specify the number of the hidden layers and the number of nodes in each layer, the default is 2 layers with 200 nodes each. Notice that the number of nodes in the first and the last layers will be specified automatically by h2o given the data.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;autoencoder&lt;/strong&gt;: If true then we train &lt;a href=&#34;https://towardsdatascience.com/autoencoder-neural-networks-what-and-how-354cba12bf86&#34;&gt;autoencoder&lt;/a&gt; model, otherwise the model will use supervised learning which is the default.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;activation&lt;/strong&gt;: the activation function used. h2o provides three ones with or without dropout: Tanh, Rectifier, Maxout, TanhWithDropout, RecifierWithDropout, MaxoutWithDropout. The default is Rectifier.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;hidden_dropout_ratio&lt;/strong&gt;: it is a regularization technique. Drop randomly a fraction of node values from a hidden layer. The default is 0.5.&lt;/li&gt;
&lt;li&gt;missing_values_handling: with two values &lt;strong&gt;MeanImputation&lt;/strong&gt; and &lt;strong&gt;Skip&lt;/strong&gt;. The default is &lt;strong&gt;MeanImputation&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;input_dropout_ratio&lt;/strong&gt;: The same as the previous argument but for the input layer. The default is 0.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;L1 and L2&lt;/strong&gt;: For lasso and ridge regularization. The default is 0 for both.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;max_w2&lt;/strong&gt;: It is the upper limit of the sum squared of the weights incoming to each node. This can help to fight the &lt;strong&gt;Exploding gradient problem&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;train_samples_per_iteration&lt;/strong&gt;: The number of samples used before declaring one iteration. At the end of one iteration, the model is scored. The default is -2, which means h2o will decide given the data.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;score_interval&lt;/strong&gt;: The alternative of the previous one, where the model will be scored after every 5 seconds with the default settings.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;score_duty_cycle&lt;/strong&gt;: It is another alternative to the two previous ones. It is the fraction of time spent in scoring, at the expense of that spent in training. The default is 0.1, which means 10% of the total time will be spent in scoring while the remaining 90% will be spent on training.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;target_ratio_comm_to_comp&lt;/strong&gt;: It is related to the cluster management. It controls the fraction of the communication time between nodes (The cluster nodes not the layer nodes). The default is 0.05, which means 5% of the total time will be spent on communication, and 95% in training inside each node.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;replicate_training_data&lt;/strong&gt;: The default is true, which means replicate the entire data on every cluster node.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;shuffle_training_data&lt;/strong&gt;: shuffle the inputs before feeding them into the network. It is recommended when we set balance_classes to true (like in our case). The default is false.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;score_validation_samples&lt;/strong&gt;: The number of samples from the validation set used in scoring. if we set this to 0 (which is the default) then the entire validation data will be used.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;score_training_samples&lt;/strong&gt;: The default is 10000, which the number of samples used from the training data to use in scoring. It is used when we do not have validation data.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;score_validation_sampling&lt;/strong&gt;: It is used when we use only a fraction of the validation (when the &lt;strong&gt;score_validation_samples&lt;/strong&gt; has been specified with other values than the default of 0). The default is &lt;strong&gt;Uniform&lt;/strong&gt;, but for our case with imbalance classes we can use instead &lt;strong&gt;Stratified&lt;/strong&gt;, which is also provided as another value for this argument.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Since, in our case the two classes are imbalanced, we convert the &lt;strong&gt;balance_classes&lt;/strong&gt; argument to true, then we leave all the other arguments to the default settings.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;#model_deep &amp;lt;- h2o.deeplearning(
#  x = 1:30,
#  y = 31,
#  training_frame = train,
#  model_id = &amp;quot;deep_def&amp;quot;,
#  balance_classes = TRUE
#)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;As we did earlier, we save the model then we load it again to prevent rerunning the model when rendering this document.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;#h2o.saveModel(model_deep, 
#              path = #&amp;quot;C://Users/dell/Documents/new-blog/content/sparklyr/h2o&amp;quot;,
#              force = TRUE)
model_deep &amp;lt;- h2o.loadModel(&amp;quot;C://Users/dell/Documents/new-blog/content/sparklyr/h2o/deep_def&amp;quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;h2o.confusionMatrix(model_deep)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;Confusion Matrix (vertical: actual; across: predicted)  for max f1 @ threshold = 9.38199481764512e-07:
          0    1    Error     Rate
0      4904    5 0.001019  =5/4909
1         0 5017 0.000000  =0/5017
Totals 4904 5022 0.000504  =5/9926&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Like the above models, this model is almost perfect for predicting the training data.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;h2o.confusionMatrix(model_deep, test)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;Confusion Matrix (vertical: actual; across: predicted)  for max f1 @ threshold = 0.991909621262387:
           0  1    Error       Rate
0      56237 10 0.000178  =10/56247
1         17 76 0.182796     =17/93
Totals 56254 86 0.000479  =27/56340&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;As we see this model fails to predict very well the minor class. This result can be expected since we only used the default values. so let’s try using some custom hyperparameter values now.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Note&lt;/strong&gt;: We will not tune any hyperparameters since we do not have many resources on my laptop.&lt;/p&gt;
&lt;p&gt;As a guideline, since the above default deep learning model fitted almost perfectly the training data and it generalized poorly to the unseen testing data, then we should think to reduce the complexity of the model and some regularization methods. So we will set the following values.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;hidden: we will use two hidden layers, with 100 each (instead of the default of 200 each).&lt;/li&gt;
&lt;li&gt;nfolds: we will use 5 folds to properly score the model using validation data (not training data).&lt;/li&gt;
&lt;li&gt;fold_assignment: set it to “Stratified” to be sure to get the minor class in all the folds. This is crucially important with imbalanced classes.&lt;/li&gt;
&lt;li&gt;hidden_dropout_ratio: we set this to 0.2 for both layers.&lt;/li&gt;
&lt;li&gt;activation: with the previous argument, we must provide the appropriate activation function &lt;strong&gt;RectifierWithDropout&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;L1: we set this argument to 0.0001.&lt;/li&gt;
&lt;li&gt;variable_importances: By default it is True, so we set it false to reduce computation time since our goal is a prediction, not explanation.&lt;/li&gt;
&lt;li&gt;shuffle_training_data: since the replicate_training_data is true (by default), we set this to true (the default is false) to shuffle the training data.&lt;/li&gt;
&lt;/ul&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;#model_deep_new &amp;lt;- h2o.deeplearning(
#  x = 1:30,
#  y = 31,
#  training_frame = train,
#  nfolds = 5,
#  fold_assignment = &amp;quot;Stratified&amp;quot;,
#  hidden = c(100,100),
#  model_id = &amp;quot;deep_new&amp;quot;,
#  standardize = TRUE,
#  balance_classes = TRUE,
#  hidden_dropout_ratios = c(0.2,0.2),
#  activation = &amp;quot;RectifierWithDropout&amp;quot;,
#  l1=1e-4,
#  variable_importances = FALSE,
#  shuffle_training_data = TRUE
#)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;To prevent this model to be rerun when rendering our rmarkdown document, we save the model and load it again to further use.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;#h2o.saveModel(model_deep_new, 
#              path = #&amp;quot;C://Users/dell/Documents/new-blog/content/sparklyr/h2o&amp;quot;,
#              force = TRUE)
model_deep_new &amp;lt;- h2o.loadModel(&amp;quot;C://Users/dell/Documents/new-blog/content/sparklyr/h2o/deep_new&amp;quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;h2o.confusionMatrix(model_deep_new)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;Confusion Matrix (vertical: actual; across: predicted)  for max f1 @ threshold = 1.69163968745202e-06:
          0    1    Error        Rate
0      4908   77 0.015446    =77/4985
1        51 5022 0.010053    =51/5073
Totals 4959 5099 0.012726  =128/10058&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;As expected, this model has less accuracy than the default one due to its less flexibility. In other words, it has a larger bias but we hope it has also lower variance, which can be verified by using the testing set.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;h2o.confusionMatrix(model_deep_new, test)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;Confusion Matrix (vertical: actual; across: predicted)  for max f1 @ threshold = 0.250769269598065:
           0  1    Error       Rate
0      56228 19 0.000338  =19/56247
1         15 78 0.161290     =15/93
Totals 56243 97 0.000603  =34/56340&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;With these new settings, we obtained a large improvement for the error rate of the minor class with 16% (compared to the default model with 18%). But this rate still larger than that of the best random forest model (13.97%). If you have enough time you can improve your model by applying a grid search to some hyperparameters.&lt;/p&gt;
&lt;p&gt;Finally, when you finish your work do not forget to shut down your h2o to free your resources as follows:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;h2o.shutdown()&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;Are you sure you want to shutdown the H2O instance running at http://localhost:54321/ (Y/N)? &lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div id=&#34;conclusion&#34; class=&#34;section level1&#34; number=&#34;6&#34;&gt;
&lt;h1&gt;&lt;span class=&#34;header-section-number&#34;&gt;6&lt;/span&gt; Conclusion:&lt;/h1&gt;
&lt;p&gt;Maybe the most important thing learned from this article is how important the hyperparameter values on the model performance. The difference (of performance) can be larger between models of the same type (with different hyperparameter values) than the difference between different types of models. In other words, if you do not have enough time, so exploit your time to fine-tune the hyperparameters of the same model rather than try a different type of models. In practice, for large and complex datasets, the most powerful models are by order: Deep learning, Xgboost, and Random forest.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;further-reading&#34; class=&#34;section level1&#34; number=&#34;7&#34;&gt;
&lt;h1&gt;&lt;span class=&#34;header-section-number&#34;&gt;7&lt;/span&gt; Further reading&lt;/h1&gt;
&lt;ul&gt;
&lt;li&gt;Darren Cook, Practical Machine Learning with h2o, O’Reilly, 2017.&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://docs.h2o.ai&#34; class=&#34;uri&#34;&gt;https://docs.h2o.ai&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
&lt;div id=&#34;session-information&#34; class=&#34;section level1&#34; number=&#34;8&#34;&gt;
&lt;h1&gt;&lt;span class=&#34;header-section-number&#34;&gt;8&lt;/span&gt; Session information&lt;/h1&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;sessionInfo()&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;R version 4.0.1 (2020-06-06)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19041)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252 
[2] LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] h2o_3.30.1.3    forcats_0.5.0   stringr_1.4.0   dplyr_1.0.2    
 [5] purrr_0.3.4     readr_1.3.1     tidyr_1.1.2     tibble_3.0.3   
 [9] ggplot2_3.3.2   tidyverse_1.3.0

loaded via a namespace (and not attached):
 [1] tidyselect_1.1.0  xfun_0.18         haven_2.3.1       colorspace_1.4-1 
 [5] vctrs_0.3.4       generics_0.0.2    htmltools_0.5.0   yaml_2.2.1       
 [9] blob_1.2.1        rlang_0.4.7       pillar_1.4.6      glue_1.4.2       
[13] withr_2.3.0       DBI_1.1.0         bit64_4.0.5       dbplyr_1.4.4     
[17] modelr_0.1.8      readxl_1.3.1      lifecycle_0.2.0   munsell_0.5.0    
[21] blogdown_0.20     gtable_0.3.0      cellranger_1.1.0  rvest_0.3.6      
[25] codetools_0.2-16  evaluate_0.14     knitr_1.30        fansi_0.4.1      
[29] highr_0.8         broom_0.7.1       Rcpp_1.0.5        scales_1.1.1     
[33] backports_1.1.10  jsonlite_1.7.1    bit_4.0.4         fs_1.5.0         
[37] hms_0.5.3         digest_0.6.25     stringi_1.5.3     bookdown_0.20    
[41] grid_4.0.1        bitops_1.0-6      cli_2.0.2         tools_4.0.1      
[45] ROSE_0.0-3        magrittr_1.5      RCurl_1.98-1.2    crayon_1.3.4     
[49] pkgconfig_2.0.3   ellipsis_0.3.1    data.table_1.13.0 xml2_1.3.2       
[53] reprex_0.3.0      lubridate_1.7.9   assertthat_0.2.1  rmarkdown_2.4    
[57] httr_1.4.2        rstudioapi_0.11   R6_2.4.1          compiler_4.0.1   &lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div class=&#34;footnotes&#34;&gt;
&lt;hr /&gt;
&lt;ol&gt;
&lt;li id=&#34;fn1&#34;&gt;&lt;p&gt;Darren cook, Practical Machine Learning Model With h2o, O’Reilly, 2017, p115&lt;a href=&#34;#fnref1&#34; class=&#34;footnote-back&#34;&gt;↩︎&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;/div&gt;
</description>
    </item>
    
    <item>
      <title>Bayesian hyperparameters optimization</title>
      <link>https://modelingwithr.rbind.io/bayes/hyper_bayes/bayesian-hyperparameters-method/</link>
      <pubDate>Wed, 13 May 2020 00:00:00 +0000</pubDate>
      <guid>https://modelingwithr.rbind.io/bayes/hyper_bayes/bayesian-hyperparameters-method/</guid>
      <description>
&lt;script src=&#34;https://modelingwithr.rbind.io/rmarkdown-libs/header-attrs/header-attrs.js&#34;&gt;&lt;/script&gt;

&lt;div id=&#34;TOC&#34;&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&#34;#introduction&#34;&gt;&lt;span class=&#34;toc-section-number&#34;&gt;1&lt;/span&gt; Introduction&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#bayesian-optimization&#34;&gt;&lt;span class=&#34;toc-section-number&#34;&gt;2&lt;/span&gt; Bayesian optimization&lt;/a&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&#34;#acquisition-functions&#34;&gt;&lt;span class=&#34;toc-section-number&#34;&gt;2.1&lt;/span&gt; Acquisition functions&lt;/a&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&#34;#upper-confidence-bound&#34;&gt;&lt;span class=&#34;toc-section-number&#34;&gt;2.1.1&lt;/span&gt; Upper confidence bound&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#probability-of-improvement-pi&#34;&gt;&lt;span class=&#34;toc-section-number&#34;&gt;2.1.2&lt;/span&gt; Probability of improvement PI&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#expected-improvement-ei&#34;&gt;&lt;span class=&#34;toc-section-number&#34;&gt;2.1.3&lt;/span&gt; Expected improvement EI&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;&lt;/li&gt;
&lt;/ul&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#data-preparation&#34;&gt;&lt;span class=&#34;toc-section-number&#34;&gt;3&lt;/span&gt; Data preparation&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#random-forest-model&#34;&gt;&lt;span class=&#34;toc-section-number&#34;&gt;4&lt;/span&gt; Random forest model&lt;/a&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&#34;#the-true-distribution-of-the-hyperparameters&#34;&gt;&lt;span class=&#34;toc-section-number&#34;&gt;4.1&lt;/span&gt; The true distribution of the hyperparameters&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#random-search&#34;&gt;&lt;span class=&#34;toc-section-number&#34;&gt;4.2&lt;/span&gt; random search&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#bayesian-optimization-ucb&#34;&gt;&lt;span class=&#34;toc-section-number&#34;&gt;4.3&lt;/span&gt; bayesian optimization UCB&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#bayesian-optimization-pi&#34;&gt;&lt;span class=&#34;toc-section-number&#34;&gt;4.4&lt;/span&gt; bayesian optimization PI&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#bayesian-optimization-ei&#34;&gt;&lt;span class=&#34;toc-section-number&#34;&gt;4.5&lt;/span&gt; bayesian optimization EI&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#contrast-the-results&#34;&gt;&lt;span class=&#34;toc-section-number&#34;&gt;4.6&lt;/span&gt; Contrast the results&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#deep-learning-model&#34;&gt;&lt;span class=&#34;toc-section-number&#34;&gt;5&lt;/span&gt; deep learning model&lt;/a&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&#34;#random-search-1&#34;&gt;&lt;span class=&#34;toc-section-number&#34;&gt;5.1&lt;/span&gt; Random search&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#bayesian-optimization-ucb-1&#34;&gt;&lt;span class=&#34;toc-section-number&#34;&gt;5.2&lt;/span&gt; Bayesian optimization UCB&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#bayesian-optimization-pi-1&#34;&gt;&lt;span class=&#34;toc-section-number&#34;&gt;5.3&lt;/span&gt; Bayesian optimization PI&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#bayesian-optimization-ei-1&#34;&gt;&lt;span class=&#34;toc-section-number&#34;&gt;5.4&lt;/span&gt; Bayesian optimization EI&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#contrast-the-results-1&#34;&gt;&lt;span class=&#34;toc-section-number&#34;&gt;5.5&lt;/span&gt; Contrast the results&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#conclusion&#34;&gt;&lt;span class=&#34;toc-section-number&#34;&gt;6&lt;/span&gt; Conclusion&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#session-info&#34;&gt;&lt;span class=&#34;toc-section-number&#34;&gt;7&lt;/span&gt; Session info&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;

&lt;style type=&#34;text/css&#34;&gt;
strong {
  color: Navy;
}

h1,h2, h3, h4 {
  font-size:28px;
  color:DarkBlue;
}
&lt;/style&gt;
&lt;div id=&#34;introduction&#34; class=&#34;section level1&#34; number=&#34;1&#34;&gt;
&lt;h1&gt;&lt;span class=&#34;header-section-number&#34;&gt;1&lt;/span&gt; Introduction&lt;/h1&gt;
&lt;p&gt;Machine learning models are called by this name because of their ability to learn the best parameter values that are as closest as possible to the right values of the optimum objective function (or loss function). However, since all models require some assumptions (like linearity in linear regression models), parameters (like the cost C in svm models), and settings (like the number of layers in deep learning models) to be prespecified before training the model (in most cases are set by default ), the name of &lt;strong&gt;machine learning&lt;/strong&gt; is not fully justified. Theses prespecified parameters are called &lt;strong&gt;hyperparameters&lt;/strong&gt;, and should be defined in such a way that the corresponding model reaches its best performance (conditionally on the data at hand).&lt;/p&gt;
&lt;p&gt;The search of the best hyperparameters called &lt;strong&gt;tuning&lt;/strong&gt;, which simply is training the model with each combination of hyperparameter values many times, using some techniques like cross-validation to make sure that the resulted loss functions highly likely depends on this specific combination values (not by random chances), and then pick the best one that gives the optimum value for the objective function. This means that if our model requires long computational time (due to fewer hardware resources, or large dataset) we cannot try a large number of combinations, and hence we will be likely far away from the best result.&lt;/p&gt;
&lt;p&gt;The main problem, however, is how do we define the space of these hyperparameters to choose from. Many methods are available:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;grid search&lt;/strong&gt;: Using this method the modeler provides the values for each parameter to evaluate and then pick the combination that gives the best result. However, this method based entirely on the experience of the modeler with the model in question and the data at hand. So with not enough experience, which is often the case, the choice in most cases would be arbitrary.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Random search&lt;/strong&gt;: with this method, we choose randomly the values for each hyperparameter, and it turns out that this method is sometimes more accurate than the previous one. However, this method also suffers from the arbitrary choice of values.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Bayesian hyperparameters&lt;/strong&gt;: This method uses Bayesian optimization to guide a little bit the search strategy to get the best hyperparameter values with minimum cost (the cost is the number of models to train). We will briefly discuss this method, but if you want more detail you can check the following great &lt;a href=&#34;https://distill.pub/2020/bayesian-optimization/&#34;&gt;article&lt;/a&gt;.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;we will focus on the best one which is &lt;strong&gt;Bayesian hyperparameters&lt;/strong&gt;, but we first start by briefly introducing the others.&lt;/p&gt;
&lt;p&gt;To well understand these methods we will make use of small dataset with a small number of predictors, and we will use two models, the machine learning model &lt;strong&gt;Random forest&lt;/strong&gt; and the deep learning model &lt;strong&gt;feedforward neural network&lt;/strong&gt;.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;bayesian-optimization&#34; class=&#34;section level1&#34; number=&#34;2&#34;&gt;
&lt;h1&gt;&lt;span class=&#34;header-section-number&#34;&gt;2&lt;/span&gt; Bayesian optimization&lt;/h1&gt;
&lt;p&gt;The main idea behind this method is very simple, at the first iteration we pick a point at random, then at each iteration, and based on Bayes rule, we make a trade-off between choosing the point that has the highest uncertainty (known as &lt;strong&gt;active learning&lt;/strong&gt;) or choosing the point within the region that has already the best result (optimum objective function) until the current iteration. In more detail, let’s say we are dealing with maximization problem such as maximizing the accuracy rate for a classification problem, then, at each iteration, the Bayes optimization method should decide between focusing the search on the region that contains the best point (that resulted in the maximum objective function) until the current iteration (called &lt;strong&gt;exploitation&lt;/strong&gt;), or inspecting another region with the highest uncertainty (called &lt;strong&gt;exploration&lt;/strong&gt;). The question, however, is how can this method decide between these two options. Well, this method uses what is called &lt;strong&gt;acquisition function&lt;/strong&gt;, which is a function that helps to decide what region should we inspect in the next iteration.&lt;/p&gt;
&lt;div id=&#34;acquisition-functions&#34; class=&#34;section level2&#34; number=&#34;2.1&#34;&gt;
&lt;h2&gt;&lt;span class=&#34;header-section-number&#34;&gt;2.1&lt;/span&gt; Acquisition functions&lt;/h2&gt;
&lt;p&gt;Since this problem is always partially random, there exist many forms for this function, and we will discuss briefly the most common ones.&lt;/p&gt;
&lt;div id=&#34;upper-confidence-bound&#34; class=&#34;section level3&#34; number=&#34;2.1.1&#34;&gt;
&lt;h3&gt;&lt;span class=&#34;header-section-number&#34;&gt;2.1.1&lt;/span&gt; Upper confidence bound&lt;/h3&gt;
&lt;p&gt;Using this function, the following located point will be that that has the highest upper confidence bound, and, assuming the Gaussian process, this bound will be computed as follows:&lt;/p&gt;
&lt;p&gt;&lt;span class=&#34;math display&#34;&gt;\[UCB(x)=\mu(x)+\kappa\sigma(x)\]&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;Where &lt;span class=&#34;math inline&#34;&gt;\(\mu\)&lt;/span&gt; and &lt;span class=&#34;math inline&#34;&gt;\(\sigma\)&lt;/span&gt; are the mean and the standard deviation defined by the Gaussian process, and &lt;span class=&#34;math inline&#34;&gt;\(\kappa\)&lt;/span&gt; is an exploration parameter, the larger the values the more the exploration.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;probability-of-improvement-pi&#34; class=&#34;section level3&#34; number=&#34;2.1.2&#34;&gt;
&lt;h3&gt;&lt;span class=&#34;header-section-number&#34;&gt;2.1.2&lt;/span&gt; Probability of improvement PI&lt;/h3&gt;
&lt;p&gt;this acquisition function chooses the next point that has the highest probability of improvement over the current maximum objective function &lt;span class=&#34;math inline&#34;&gt;\(f_{max}\)&lt;/span&gt; obtained from the points evaluated previously.&lt;/p&gt;
&lt;p&gt;Assuming the Gaussian process, the new point then will be the one that has the highest following probability:&lt;/p&gt;
&lt;p&gt;&lt;span class=&#34;math display&#34;&gt;\[PI(x)=\Phi\left(\frac{\mu(x)-f{max}-\varepsilon}{\sigma(x)}\right)\]&lt;/span&gt;
Where &lt;span class=&#34;math inline&#34;&gt;\(\varepsilon\)&lt;/span&gt; plays the role for trading off between exploration and exploitation, such that larger values result in more exploration than exploitation&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;expected-improvement-ei&#34; class=&#34;section level3&#34; number=&#34;2.1.3&#34;&gt;
&lt;h3&gt;&lt;span class=&#34;header-section-number&#34;&gt;2.1.3&lt;/span&gt; Expected improvement EI&lt;/h3&gt;
&lt;p&gt;This acquisition function, unlike the previous one, tries to quantify how much improvement we get with the new point, the new point that has the maximum expected value will be chosen.&lt;/p&gt;
&lt;p&gt;Again, by assuming the Gaussian process, this function can be computed as follows:&lt;/p&gt;
&lt;p&gt;&lt;span class=&#34;math display&#34;&gt;\[EI(x) = \left(\mu(x)-f_{max}\right)\Phi\left(\frac{\mu(x)-f{max}-\varepsilon}{\sigma(x)}\right)+\sigma(x)\phi\left(\frac{\mu(x)-f{max}-\varepsilon}{\sigma(x)}\right)\]&lt;/span&gt;&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div id=&#34;data-preparation&#34; class=&#34;section level1&#34; number=&#34;3&#34;&gt;
&lt;h1&gt;&lt;span class=&#34;header-section-number&#34;&gt;3&lt;/span&gt; Data preparation&lt;/h1&gt;
&lt;p&gt;Let’s first call the packages needed along this article.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;options(warn=-1)
library(readr)
library(tidymodels)
library(themis)
library(plot3D)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;For our illustration, we will use data downloaded from the link described in the below script. It is already split between a training set and a testing set. The target variable is a highly imbalanced binary variable, and the data has a very large number of missing values. To make our analysis simpler, however, we will reduce the size of this data by first, removing the variables with a large number of missing values, then removing completely the remaining ones, and lastly correcting the imbalance distribution using downsampling method . For more detail about this data check my previous &lt;a href=&#34;https://modelingwithr.rbind.io/post/scania/predicting-large-and-imbalanced-data-set-using-the-r-package-tidymodels/&#34;&gt;article&lt;/a&gt;.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;options(warn=-1)
train &amp;lt;- read_csv(&amp;quot;https://archive.ics.uci.edu/ml/machine-learning-databases/00421/aps_failure_training_set.csv&amp;quot;, skip = 20)
test &amp;lt;- read_csv(&amp;quot;https://archive.ics.uci.edu/ml/machine-learning-databases/00421/aps_failure_test_set.csv&amp;quot;, skip = 20)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This data has 171 variables and a total of 76000 instances, 60000 in the training set, and 16000 in the testing sets. Notice that the missing values in the data are represented by lower case &lt;strong&gt;na&lt;/strong&gt; values and not &lt;strong&gt;NA&lt;/strong&gt; and hence are not recognized by R as missing values, That is why the &lt;strong&gt;read_csv&lt;/strong&gt; function converted all the variables that have these values to the character type.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;map_chr(train, typeof) %&amp;gt;% 
  tibble() %&amp;gt;% 
  table()&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;.
character    double 
      170         1 &lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;To fix this problem, therefore, we replace na by NA then converting back the corresponding variable to the numeric type.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;train[-1] &amp;lt;- train[-1] %&amp;gt;% 
  modify(~replace(., .==&amp;quot;na&amp;quot;, NA)) %&amp;gt;%
  modify(., as.double)

test[-1] &amp;lt;- test[-1] %&amp;gt;% 
  modify(~replace(., .==&amp;quot;na&amp;quot;, NA)) %&amp;gt;%
  modify(., as.double)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Then we keep only the predictors that have less than 600 missing values, and thereafter, we keep only rows without missing values.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;names &amp;lt;- modify(train[-1], is.na) %&amp;gt;% 
  colSums() %&amp;gt;%
  tibble(names = colnames(train[-1]), missing_values=.) %&amp;gt;% 
  filter(missing_values &amp;lt; 600) %&amp;gt;% 
  select(1)
train1 &amp;lt;- train[c(&amp;quot;class&amp;quot;,names$names)] %&amp;gt;% 
  .[complete.cases(.),]
test1 &amp;lt;- test[c(&amp;quot;class&amp;quot;,names$names)] %&amp;gt;% 
  .[complete.cases(.),]
dim(train1)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;[1] 58888    11&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;dim(test1)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;[1] 15728    11&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;As we see the data now has been reduced to 11 variables. The last thing to check is the distribution of the target variable.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;prop.table(table(train1$class))&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;
       neg        pos 
0.98376579 0.01623421 &lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;As we see, the data is highly imbalanced, so we correct this problem by downsamplig using the &lt;strong&gt;themis&lt;/strong&gt; package. But first, we create a recipe that defines the formula of our model, normalizes the predictors, and down sample the data. Then we execute the recipe to the data using &lt;strong&gt;prep&lt;/strong&gt; function, and lastly we retrieve the transformed data by &lt;strong&gt;juice&lt;/strong&gt; function.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;train2 &amp;lt;- recipe(class~., data=train1) %&amp;gt;%
  step_normalize(all_predictors()) %&amp;gt;% 
  step_downsample(class, seed = 111) %&amp;gt;%
  prep() %&amp;gt;% 
  juice()&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;For the testing set, however, we can transform it by the same recipe, but at the end we make use of the &lt;strong&gt;bake&lt;/strong&gt; function instead of &lt;strong&gt;juice&lt;/strong&gt; that uses the data defined in the recipe.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;test2 &amp;lt;- recipe(class~., data=train1) %&amp;gt;%
  step_normalize(all_predictors()) %&amp;gt;%
  themis::step_downsample(class, seed = 111) %&amp;gt;% 
  prep() %&amp;gt;%
  bake(test1)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;It should be noted here that the &lt;strong&gt;step_downsample&lt;/strong&gt; is not needed for the testing set, that is why, the &lt;strong&gt;bake&lt;/strong&gt; does not execute this step by default applied to any new data. We can check this by the dimension that still the same.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;dim(test1); dim(test2)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;[1] 15728    11&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;[1] 15728    11&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;strong&gt;Note&lt;/strong&gt;: The testing set is not needed, since our purpose is the hyperparameter tuning methods, which requires only the training set. But it is included only to show how it is processed with the work flow of tidymodels package.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;random-forest-model&#34; class=&#34;section level1&#34; number=&#34;4&#34;&gt;
&lt;h1&gt;&lt;span class=&#34;header-section-number&#34;&gt;4&lt;/span&gt; Random forest model&lt;/h1&gt;
&lt;p&gt;Since we are dealing with a classification problem, our objective function will be the area under the ROC curve &lt;strong&gt;roc_area&lt;/strong&gt;. And for the model, we will use the most popular one, &lt;strong&gt;Random forest&lt;/strong&gt; model with the two hyperparameters to tune:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;mtry&lt;/strong&gt;: The number of sampled predictors at each step.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;min_n&lt;/strong&gt;: The minimum number of instances in the node to split further.&lt;/li&gt;
&lt;/ul&gt;
&lt;div id=&#34;the-true-distribution-of-the-hyperparameters&#34; class=&#34;section level2&#34; number=&#34;4.1&#34;&gt;
&lt;h2&gt;&lt;span class=&#34;header-section-number&#34;&gt;4.1&lt;/span&gt; The true distribution of the hyperparameters&lt;/h2&gt;
&lt;p&gt;With this small data, this model, in the sense of computational time, is not expensive, so we will try a wide range of values for the above hyperparameters, then we can use this as the true distribution to compare the performance of the hyperparameter tuning methods discussed above.&lt;/p&gt;
&lt;p&gt;Using &lt;strong&gt;tidymodels&lt;/strong&gt; package we define the model and we leave the &lt;code&gt;mtry&lt;/code&gt; and &lt;code&gt;min_n&lt;/code&gt; for tuning. To speed up, however, computation process we restrict the number of trees to 100 instead of the default fo 500 trees.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;model_tune &amp;lt;- rand_forest(mtry = tune(), min_n = tune(), trees = 100L) %&amp;gt;% 
  set_engine(&amp;quot;ranger&amp;quot;, seed=222) %&amp;gt;% 
  set_mode(&amp;quot;classification&amp;quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;To avoid the pitfalls of random samples, we use the cross-validation technique.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;set.seed(1234)
folds &amp;lt;- vfold_cv(train2, v=5, strata = class) &lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Then we use the following work flow:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;tune_wf &amp;lt;- workflow() %&amp;gt;% 
  add_model(model_tune) %&amp;gt;% 
  add_formula(class~.)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Now we are ready to train the model using a wide range of combination values, and then we assume that these combinations are the true distribution of our model. Note that we have 100 combinations, so if the model takes hours to train, then it will require days for tuning. To prevent the model to rerun when rendering this document we save the results in csv file then we load it again. If you want, however, to run this model then uncomment the script&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;#tuned &amp;lt;- tune_wf %&amp;gt;% 
#   tune_grid(resamples = folds, 
#            grid =expand.grid(mtry=1:10, min_n=1:10),
#            metrics=metric_set(roc_auc))&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;To extract the results we use the &lt;strong&gt;collect_metrics&lt;/strong&gt; function.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;#df &amp;lt;- tuned %&amp;gt;% collect_metrics()
#write_csv(df, &amp;quot;C:/Users/dell/Documents/new-blog/content/bayes/hyper_bayes/df.csv&amp;quot;)
df &amp;lt;- read_csv(&amp;quot;C:/Users/dell/Documents/new-blog/content/bayes/hyper_bayes/df.csv&amp;quot;)
df &amp;lt;- df %&amp;gt;% arrange(-mean) %&amp;gt;% tibble(rank=seq_len(nrow(df)), .)&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;df&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;# A tibble: 100 x 8
    rank  mtry min_n .metric .estimator  mean     n std_err
   &amp;lt;int&amp;gt; &amp;lt;dbl&amp;gt; &amp;lt;dbl&amp;gt; &amp;lt;chr&amp;gt;   &amp;lt;chr&amp;gt;      &amp;lt;dbl&amp;gt; &amp;lt;dbl&amp;gt;   &amp;lt;dbl&amp;gt;
 1     1     1     2 roc_auc binary     0.984     5 0.00198
 2     2     3     6 roc_auc binary     0.984     5 0.00204
 3     3     3     7 roc_auc binary     0.983     5 0.00269
 4     4     3     8 roc_auc binary     0.983     5 0.00304
 5     5     2    10 roc_auc binary     0.983     5 0.00200
 6     6     3     4 roc_auc binary     0.983     5 0.00246
 7     7     4     4 roc_auc binary     0.983     5 0.00224
 8     8     2     1 roc_auc binary     0.983     5 0.00170
 9     9     3     5 roc_auc binary     0.983     5 0.00271
10    10     4     5 roc_auc binary     0.983     5 0.00246
# ... with 90 more rows&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;We reached the maximum value for the objective function &lt;strong&gt;0.9837565&lt;/strong&gt; with the following hyperparameter values &lt;code&gt;mtry = 1&lt;/code&gt;, and &lt;code&gt;min_n = 2&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Note&lt;/strong&gt;: we will ignore the overfitting problem, since we are only making comparison using the same training set.&lt;/p&gt;
&lt;p&gt;We can plot this distribution as follows:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;scatter3D(x=df$mtry, y=df$min_n, z=df$mean, phi = 0, bty = &amp;quot;g&amp;quot;,  type = &amp;quot;h&amp;quot;, 
          ticktype = &amp;quot;detailed&amp;quot;, pch = 19, cex = 0.5, 
          main=&amp;quot;the true distribution&amp;quot;, xlab=&amp;quot;mtry&amp;quot;,
          ylab=&amp;quot;min_n&amp;quot;, zlab=&amp;quot;roc_auc&amp;quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://modelingwithr.rbind.io/bayes/hyper_bayes/2020-05-13-bayesian-hyperparameters-method_files/figure-html/unnamed-chunk-17-1.svg&#34; width=&#34;576&#34; /&gt;&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;random-search&#34; class=&#34;section level2&#34; number=&#34;4.2&#34;&gt;
&lt;h2&gt;&lt;span class=&#34;header-section-number&#34;&gt;4.2&lt;/span&gt; random search&lt;/h2&gt;
&lt;p&gt;Now suppose that we are allowed to try only 10 combinations. For random search strategy, we use the &lt;strong&gt;grid_random&lt;/strong&gt; function from &lt;strong&gt;dials&lt;/strong&gt; package (embedded in tidymodels package).&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;#tuned_random &amp;lt;- tune_wf %&amp;gt;% 
#  tune_grid(resamples = folds, 
#            grid = grid_random(mtry(range = c(1,10)), min_n(range = c(1,10)), #size = 10),
#            metrics=metric_set(roc_auc))
#df_r &amp;lt;- tuned_random %&amp;gt;% collect_metrics()

#write_csv(df_r, &amp;quot;C:/Users/dell/Documents/new-blog/content/bayes/hyper_bayes/df_r.csv&amp;quot;)
df_r &amp;lt;- read_csv(&amp;quot;C:/Users/dell/Documents/new-blog/content/bayes/hyper_bayes/df_r.csv&amp;quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;df_r %&amp;gt;% arrange(-mean) %&amp;gt;% 
  head(1) %&amp;gt;% 
  inner_join(df, by=&amp;quot;mean&amp;quot;) %&amp;gt;%
  select(rank, mtry.x, min_n.x, mean)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;# A tibble: 1 x 4
   rank mtry.x min_n.x  mean
  &amp;lt;int&amp;gt;  &amp;lt;dbl&amp;gt;   &amp;lt;dbl&amp;gt; &amp;lt;dbl&amp;gt;
1    13      3      10 0.983&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The maximum value obtained by this method is 0.9826900, which corresponds to the 13th rank compared to the maximum value of the the true distribution, and the associated hyperparameters values are &lt;code&gt;mtry=3&lt;/code&gt;, &lt;code&gt;min_n=10&lt;/code&gt;&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;bayesian-optimization-ucb&#34; class=&#34;section level2&#34; number=&#34;4.3&#34;&gt;
&lt;h2&gt;&lt;span class=&#34;header-section-number&#34;&gt;4.3&lt;/span&gt; bayesian optimization UCB&lt;/h2&gt;
&lt;p&gt;For the Bayesian optimization method, we make use of the &lt;strong&gt;tune_bayes&lt;/strong&gt; function from &lt;strong&gt;tune&lt;/strong&gt; package that provides all the acquisition functions discussed above. So we start by UCB function, and we set the argument &lt;strong&gt;kappa&lt;/strong&gt; equals to 2, which controls the trade-off between exploitation and exploration such that larger values lead to more exploration. Notice that, by default, this function explores 10 combinations, if you want another number change the argument &lt;strong&gt;iter&lt;/strong&gt;.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;#tuned_UCB &amp;lt;- tune_wf %&amp;gt;% 
#  tune_bayes(resamples = folds,
#          param_info=parameters(mtry(range = c(1,10)), min_n(range = c(1,10))),
#             metrics=metric_set(roc_auc),
#             objective=conf_bound(kappa = 2))

#df_UCB &amp;lt;- tuned_UCB %&amp;gt;% collect_metrics()
#write_csv(df_UCB,&amp;quot;C:/Users/dell/Documents/new-blog/content/bayes/hyper_bayes/df_UCB.csv&amp;quot;)
df_UCB &amp;lt;- read_csv(&amp;quot;C:/Users/dell/Documents/new-blog/content/bayes/hyper_bayes/df_UCB.csv&amp;quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;df_UCB %&amp;gt;% arrange(-mean) %&amp;gt;% 
  head(1) %&amp;gt;% 
  inner_join(df, by=&amp;quot;mean&amp;quot;) %&amp;gt;%
  select(rank, mtry.x, min_n.x, mean)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;# A tibble: 1 x 4
   rank mtry.x min_n.x  mean
  &amp;lt;int&amp;gt;  &amp;lt;dbl&amp;gt;   &amp;lt;dbl&amp;gt; &amp;lt;dbl&amp;gt;
1     2      3       6 0.984&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;With this acquisition function, we obtained better result than the random search 0.9837529 , which occupies the second position.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;bayesian-optimization-pi&#34; class=&#34;section level2&#34; number=&#34;4.4&#34;&gt;
&lt;h2&gt;&lt;span class=&#34;header-section-number&#34;&gt;4.4&lt;/span&gt; bayesian optimization PI&lt;/h2&gt;
&lt;p&gt;This time we will explore the acquisition functions PI discussed above. Note that larger values for the argument trade-off lead to more exploration than exploitation.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;#tuned_PI &amp;lt;- tune_wf %&amp;gt;% 
#  tune_bayes(resamples = folds,
#         param_info=parameters(mtry(range = c(1,10)), min_n(range = c(1,10))),
#             metrics=metric_set(roc_auc),
#             objective=prob_improve(trade_off = 0.01))

#df_PI &amp;lt;- tuned_PI %&amp;gt;% collect_metrics()
#write_csv(df_PI, &amp;quot;C:/Users/dell/Documents/new-blog/content/bayes/hyper_bayes/df_PI.csv&amp;quot;)
df_PI &amp;lt;- read_csv(&amp;quot;C:/Users/dell/Documents/new-blog/content/bayes/hyper_bayes/df_PI.csv&amp;quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;df_PI %&amp;gt;% arrange(-mean) %&amp;gt;% 
  head(1) %&amp;gt;% 
  inner_join(df, by=&amp;quot;mean&amp;quot;) %&amp;gt;%
  select(rank, mtry.x, min_n.x, mean)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;# A tibble: 1 x 4
   rank mtry.x min_n.x  mean
  &amp;lt;int&amp;gt;  &amp;lt;dbl&amp;gt;   &amp;lt;dbl&amp;gt; &amp;lt;dbl&amp;gt;
1    11      2       9 0.983&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;As we see, with this acquisition function, we obtained the 11th position which is more worst than the previous one, but still better than random search.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;bayesian-optimization-ei&#34; class=&#34;section level2&#34; number=&#34;4.5&#34;&gt;
&lt;h2&gt;&lt;span class=&#34;header-section-number&#34;&gt;4.5&lt;/span&gt; bayesian optimization EI&lt;/h2&gt;
&lt;p&gt;Now we try another acquisition function, the expected improvement function.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;#tuned_EI &amp;lt;- tune_wf %&amp;gt;% 
#  tune_bayes(resamples = folds,
#             param_info=parameters(mtry(range = c(1,10)), 
#             min_n(range = c(1,10))),
#             metrics=metric_set(roc_auc),
#             objective=exp_improve(trade_off = 0.01))

#df_EI &amp;lt;- tuned_EI %&amp;gt;% collect_metrics()

#write_csv(df_EI, &amp;quot;C:/Users/dell/Documents/new-blog/content/bayes/hyper_bayes/df_EI.csv&amp;quot;)
df_EI &amp;lt;- read_csv(&amp;quot;C:/Users/dell/Documents/new-blog/content/bayes/hyper_bayes/df_EI.csv&amp;quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;df_EI %&amp;gt;% arrange(-mean) %&amp;gt;% 
  head(1) %&amp;gt;% 
  inner_join(df, by=&amp;quot;mean&amp;quot;) %&amp;gt;%
  select(rank, mtry.x, min_n.x, mean)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;# A tibble: 1 x 4
   rank mtry.x min_n.x  mean
  &amp;lt;int&amp;gt;  &amp;lt;dbl&amp;gt;   &amp;lt;dbl&amp;gt; &amp;lt;dbl&amp;gt;
1     2      3       6 0.984&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Here also we get the same result as UCB method.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;contrast-the-results&#34; class=&#34;section level2&#34; number=&#34;4.6&#34;&gt;
&lt;h2&gt;&lt;span class=&#34;header-section-number&#34;&gt;4.6&lt;/span&gt; Contrast the results&lt;/h2&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;df_rf &amp;lt;- tibble(names=c(&amp;quot;true&amp;quot;, &amp;quot;random&amp;quot;, &amp;quot;UCB&amp;quot;, &amp;quot;PI&amp;quot;, &amp;quot;EI&amp;quot;),
                mtry=c(df[df$mean==max(df$mean),1, drop=TRUE],
                  df_r[df_r$mean==max(df_r$mean),1, drop=TRUE],
                     df_UCB[df_UCB$mean==max(df_UCB$mean),1, drop=TRUE],
                     df_PI[df_PI$mean==max(df_PI$mean),1, drop=TRUE],
                     df_EI[df_EI$mean==max(df_EI$mean),1, drop=TRUE]),
             min_n=c(df[df$mean==max(df$mean),2, drop=TRUE],
               df_r[df_r$mean==max(df_r$mean),2, drop=TRUE],
                     df_UCB[df_UCB$mean==max(df_UCB$mean),2, drop=TRUE],
                     df_PI[df_PI$mean==max(df_PI$mean),2, drop=TRUE],
                     df_EI[df_EI$mean==max(df_EI$mean),2, drop=TRUE]),
             roc_auc=c(max(df$mean), max(df_r$mean),max(df_UCB$mean),
                       max(df_PI$mean),max(df_EI$mean)),
             std_err=c(df[df$mean==max(df$mean),ncol(df), drop=TRUE],
                    df_r[df_r$mean==max(df_r$mean),ncol(df_r), drop=TRUE],
                     df_UCB[df_UCB$mean==max(df_UCB$mean),ncol(df_UCB), drop=TRUE],
                     df_PI[df_PI$mean==max(df_PI$mean),ncol(df_PI), drop=TRUE],
                     df_EI[df_EI$mean==max(df_EI$mean),ncol(df_EI), drop=TRUE]))
df_rf %&amp;gt;% arrange(-roc_auc) &lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;# A tibble: 5 x 5
  names   mtry min_n roc_auc std_err
  &amp;lt;chr&amp;gt;  &amp;lt;dbl&amp;gt; &amp;lt;dbl&amp;gt;   &amp;lt;dbl&amp;gt;   &amp;lt;dbl&amp;gt;
1 true       1     1   0.984 0.00198
2 UCB        3     6   0.984 0.00204
3 EI         3     6   0.984 0.00204
4 PI         2     9   0.983 0.00290
5 random     3    10   0.983 0.00275&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;As we see, the Bayesian optimization method performs better than random search whatever acquisition function used. However, since the acquisition functions have their own hyperparameters (to trade-off between exploration and exploitation), their performance thus may differ strongly from each other. Moreover, the difference can be larger if we use a larger dataset.&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div id=&#34;deep-learning-model&#34; class=&#34;section level1&#34; number=&#34;5&#34;&gt;
&lt;h1&gt;&lt;span class=&#34;header-section-number&#34;&gt;5&lt;/span&gt; deep learning model&lt;/h1&gt;
&lt;p&gt;As we did with the random forest model we explore a wide range of hyperparameter values assuming that these values represent the true distribution, then we apply all the above tuning methods. For the model architecture, we use a deep learning model with a single layer, and for tuning, we use the two hyperparameters, the number of nodes, and the number of epochs. The last thing we have to do is to save what we need in csv file and load it again since each time we rerun the model we get a different result, and besides, it takes a lot of time.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;model_tune1 &amp;lt;- mlp(hidden_units = tune(), epochs = tune()) %&amp;gt;% 
  set_engine(&amp;quot;keras&amp;quot;) %&amp;gt;% 
  set_mode(&amp;quot;classification&amp;quot;)

tune_wf1 &amp;lt;- workflow() %&amp;gt;% 
  add_model(model_tune1) %&amp;gt;% 
  add_formula(class~.)

#tuned_deepl &amp;lt;- tune_wf1 %&amp;gt;% 
#  tune_grid(resamples = folds, 
#            grid = expand.grid(hidden_units=5:15, epochs=10:20),
#            metrics=metric_set(roc_auc))

#df1 &amp;lt;- tuned_deepl %&amp;gt;% collect_metrics()


#write_csv(df1, &amp;quot;C:/Users/dell/Documents/new-blog/content/bayes/hyper_bayes/df1.csv&amp;quot;)

df1 &amp;lt;- read_csv(&amp;quot;C:/Users/dell/Documents/new-blog/content/bayes/hyper_bayes/df1.csv&amp;quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;df1 %&amp;gt;% arrange(-mean)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;# A tibble: 121 x 7
   hidden_units epochs .metric .estimator  mean     n std_err
          &amp;lt;dbl&amp;gt;  &amp;lt;dbl&amp;gt; &amp;lt;chr&amp;gt;   &amp;lt;chr&amp;gt;      &amp;lt;dbl&amp;gt; &amp;lt;dbl&amp;gt;   &amp;lt;dbl&amp;gt;
 1           12     18 roc_auc binary     0.982     5 0.00255
 2           11     20 roc_auc binary     0.982     5 0.00265
 3            9     14 roc_auc binary     0.982     5 0.00274
 4            8     16 roc_auc binary     0.982     5 0.00207
 5           14     18 roc_auc binary     0.982     5 0.00241
 6           14     20 roc_auc binary     0.982     5 0.00244
 7           12     17 roc_auc binary     0.982     5 0.00265
 8           10     16 roc_auc binary     0.982     5 0.00230
 9           10     19 roc_auc binary     0.981     5 0.00231
10           12     20 roc_auc binary     0.981     5 0.00227
# ... with 111 more rows&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;With 12 units in the hidden layer and 18 epochs we reached the maximum value 0.9819806 for the area under the ROC curve.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;scatter3D(x=df1$hidden_units, y=df1$epochs, z=df1$mean, 
          phi = 0, bty = &amp;quot;g&amp;quot;,  type = &amp;quot;h&amp;quot;, 
          ticktype = &amp;quot;detailed&amp;quot;, pch = 19, cex = 0.5, 
          main=&amp;quot;the true distribution&amp;quot;, xlab=&amp;quot;units&amp;quot;,
          ylab=&amp;quot;epochs&amp;quot;, zlab=&amp;quot;roc_auc&amp;quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://modelingwithr.rbind.io/bayes/hyper_bayes/2020-05-13-bayesian-hyperparameters-method_files/figure-html/unnamed-chunk-29-1.svg&#34; width=&#34;576&#34; /&gt;&lt;/p&gt;
&lt;div id=&#34;random-search-1&#34; class=&#34;section level2&#34; number=&#34;5.1&#34;&gt;
&lt;h2&gt;&lt;span class=&#34;header-section-number&#34;&gt;5.1&lt;/span&gt; Random search&lt;/h2&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;#tuned_random1 &amp;lt;- tune_wf1 %&amp;gt;% 
#  tune_grid(resamples = folds, 
#            grid = grid_random(hidden_units(range = c(5,15)), epochs(range = #c(10,20)), size = 10),
#            metrics=metric_set(roc_auc))
#df_r1 &amp;lt;- tuned_random1 %&amp;gt;% collect_metrics()

#write_csv(df_r1, &amp;quot;C:/Users/dell/Documents/new-blog/content/bayes/hyper_bayes/df_#r1.csv&amp;quot;)
df_r1 &amp;lt;- read_csv(&amp;quot;C:/Users/dell/Documents/new-blog/content/bayes/hyper_bayes/df_r1.csv&amp;quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;df_r1 %&amp;gt;% arrange(-mean) &lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;# A tibble: 10 x 7
   hidden_units epochs .metric .estimator  mean     n std_err
          &amp;lt;dbl&amp;gt;  &amp;lt;dbl&amp;gt; &amp;lt;chr&amp;gt;   &amp;lt;chr&amp;gt;      &amp;lt;dbl&amp;gt; &amp;lt;dbl&amp;gt;   &amp;lt;dbl&amp;gt;
 1           14     11 roc_auc binary     0.982     5 0.00230
 2           14     17 roc_auc binary     0.981     5 0.00240
 3            8     15 roc_auc binary     0.981     5 0.00266
 4           11     19 roc_auc binary     0.981     5 0.00268
 5           14     16 roc_auc binary     0.981     5 0.00221
 6           10     12 roc_auc binary     0.980     5 0.00259
 7            7     13 roc_auc binary     0.980     5 0.00229
 8            5     16 roc_auc binary     0.980     5 0.00241
 9            9     10 roc_auc binary     0.980     5 0.00244
10            5     13 roc_auc binary     0.979     5 0.00270&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Since deep learning models strongly depend on an internal random process (like when initializing the weights), and besides, the small difference between results, due to the small size of the data, it is harder to contrast between different methods. To alleviate this problem, however, we use this time the standard errors to highlight the significance differences between the methods.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;bayesian-optimization-ucb-1&#34; class=&#34;section level2&#34; number=&#34;5.2&#34;&gt;
&lt;h2&gt;&lt;span class=&#34;header-section-number&#34;&gt;5.2&lt;/span&gt; Bayesian optimization UCB&lt;/h2&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;#tuned_UCB1 &amp;lt;- tune_wf1 %&amp;gt;% 
#  tune_bayes(resamples = folds,
#             param_info=parameters(hidden_units(range = c(5L,15L)), 
#             epochs(range = c(10L,20L))),
#             metrics=metric_set(roc_auc),
#             objective=conf_bound(kappa = 2))

#df_UCB1 &amp;lt;- tuned_UCB1 %&amp;gt;% collect_metrics()

#write_csv(df_UCB1, &amp;quot;C:/Users/dell/Documents/new-blog/content/bayes/hyper_bayes/df_UCB1.csv&amp;quot;)
df_UCB1 &amp;lt;- read_csv(&amp;quot;C:/Users/dell/Documents/new-blog/content/bayes/hyper_bayes/df_UCB1.csv&amp;quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;df_UCB1 %&amp;gt;% arrange(-mean)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;# A tibble: 15 x 8
   hidden_units epochs .iter .metric .estimator  mean     n std_err
          &amp;lt;dbl&amp;gt;  &amp;lt;dbl&amp;gt; &amp;lt;dbl&amp;gt; &amp;lt;chr&amp;gt;   &amp;lt;chr&amp;gt;      &amp;lt;dbl&amp;gt; &amp;lt;dbl&amp;gt;   &amp;lt;dbl&amp;gt;
 1           12     19     8 roc_auc binary     0.981     5 0.00237
 2           12     17     0 roc_auc binary     0.981     5 0.00250
 3            9     19    10 roc_auc binary     0.981     5 0.00203
 4           12     18     2 roc_auc binary     0.981     5 0.00236
 5           11     17     3 roc_auc binary     0.981     5 0.00248
 6           13     19     9 roc_auc binary     0.981     5 0.00246
 7           13     18     6 roc_auc binary     0.981     5 0.00260
 8           13     17     1 roc_auc binary     0.981     5 0.00254
 9           10     19     0 roc_auc binary     0.981     5 0.00230
10           11     19     7 roc_auc binary     0.981     5 0.00221
11           13     15     0 roc_auc binary     0.981     5 0.00257
12           11     18     5 roc_auc binary     0.981     5 0.00269
13           12     16     4 roc_auc binary     0.981     5 0.00213
14            5     11     0 roc_auc binary     0.979     5 0.00233
15            9     12     0 roc_auc binary     0.979     5 0.00288&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div id=&#34;bayesian-optimization-pi-1&#34; class=&#34;section level2&#34; number=&#34;5.3&#34;&gt;
&lt;h2&gt;&lt;span class=&#34;header-section-number&#34;&gt;5.3&lt;/span&gt; Bayesian optimization PI&lt;/h2&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;#tuned_PI1 &amp;lt;- tune_wf1 %&amp;gt;% 
#  tune_bayes(resamples = folds,
#             param_info=parameters(hidden_units(range = c(5L,15L)), 
#             epochs(range = c(10L,20L))),
#             metrics=metric_set(roc_auc),
#             objective=prob_improve(trade_off = 0.01))

#df_PI1 &amp;lt;- tuned_PI1 %&amp;gt;% collect_metrics()

#write_csv(df_PI1, &amp;quot;C:/Users/dell/Documents/new-blog/content/bayes/hyper_bayes/df_PI1.csv&amp;quot;)
df_PI1 &amp;lt;- read_csv(&amp;quot;C:/Users/dell/Documents/new-blog/content/bayes/hyper_bayes/df_PI1.csv&amp;quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;df_PI1 %&amp;gt;% arrange(-mean)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;# A tibble: 15 x 8
   hidden_units epochs .iter .metric .estimator  mean     n std_err
          &amp;lt;dbl&amp;gt;  &amp;lt;dbl&amp;gt; &amp;lt;dbl&amp;gt; &amp;lt;chr&amp;gt;   &amp;lt;chr&amp;gt;      &amp;lt;dbl&amp;gt; &amp;lt;dbl&amp;gt;   &amp;lt;dbl&amp;gt;
 1            7     20     6 roc_auc binary     0.981     5 0.00236
 2           10     10     5 roc_auc binary     0.981     5 0.00237
 3            9     17     0 roc_auc binary     0.981     5 0.00216
 4           15     16     8 roc_auc binary     0.981     5 0.00213
 5           14     20     0 roc_auc binary     0.981     5 0.00251
 6           11     20     4 roc_auc binary     0.981     5 0.00282
 7           15     10     3 roc_auc binary     0.981     5 0.00272
 8            6     10     0 roc_auc binary     0.981     5 0.00286
 9           13     20    10 roc_auc binary     0.981     5 0.00253
10           12     14     0 roc_auc binary     0.981     5 0.00268
11            8     12     0 roc_auc binary     0.980     5 0.00220
12           13     10     7 roc_auc binary     0.980     5 0.00273
13            8     10     1 roc_auc binary     0.980     5 0.00305
14            5     20     2 roc_auc binary     0.980     5 0.00227
15            5     12     9 roc_auc binary     0.979     5 0.00234&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div id=&#34;bayesian-optimization-ei-1&#34; class=&#34;section level2&#34; number=&#34;5.4&#34;&gt;
&lt;h2&gt;&lt;span class=&#34;header-section-number&#34;&gt;5.4&lt;/span&gt; Bayesian optimization EI&lt;/h2&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;#tuned_EI1 &amp;lt;- tune_wf1 %&amp;gt;% 
#  tune_bayes(resamples = folds,
#             param_info=parameters(hidden_units(range = c(5L,15L)), 
#             epochs(range = c(10L,20L))),
#             metrics=metric_set(roc_auc),
#             objective=exp_improve(trade_off = 0.01))

#df_EI1 &amp;lt;- tuned_EI1 %&amp;gt;% collect_metrics()

#write_csv(df_EI1, &amp;quot;C:/Users/dell/Documents/new-blog/content/bayes/hyper_bayes/df_EI1.csv&amp;quot;)
df_EI1 &amp;lt;- read_csv(&amp;quot;C:/Users/dell/Documents/new-blog/content/bayes/hyper_bayes/df_EI1.csv&amp;quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;df_EI1 %&amp;gt;% arrange(-mean)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;# A tibble: 15 x 8
   hidden_units epochs .iter .metric .estimator  mean     n std_err
          &amp;lt;dbl&amp;gt;  &amp;lt;dbl&amp;gt; &amp;lt;dbl&amp;gt; &amp;lt;chr&amp;gt;   &amp;lt;chr&amp;gt;      &amp;lt;dbl&amp;gt; &amp;lt;dbl&amp;gt;   &amp;lt;dbl&amp;gt;
 1           10     20     0 roc_auc binary     0.981     5 0.00229
 2           15     18     1 roc_auc binary     0.981     5 0.00263
 3            7     16     0 roc_auc binary     0.981     5 0.00286
 4           12     20     8 roc_auc binary     0.981     5 0.00276
 5            8     19     6 roc_auc binary     0.981     5 0.00231
 6           14     14     0 roc_auc binary     0.981     5 0.00229
 7           13     14     7 roc_auc binary     0.981     5 0.00253
 8            5     20    10 roc_auc binary     0.981     5 0.00205
 9            6     13     5 roc_auc binary     0.981     5 0.00244
10           12     10     0 roc_auc binary     0.980     5 0.00266
11           11     14     4 roc_auc binary     0.980     5 0.00191
12           10     10     9 roc_auc binary     0.980     5 0.00284
13            5     12     0 roc_auc binary     0.980     5 0.00274
14            5     10     2 roc_auc binary     0.980     5 0.00261
15            9     14     3 roc_auc binary     0.979     5 0.00243&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div id=&#34;contrast-the-results-1&#34; class=&#34;section level2&#34; number=&#34;5.5&#34;&gt;
&lt;h2&gt;&lt;span class=&#34;header-section-number&#34;&gt;5.5&lt;/span&gt; Contrast the results&lt;/h2&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;df_deep &amp;lt;- tibble(names=c(&amp;quot;true&amp;quot;, &amp;quot;random&amp;quot;, &amp;quot;UCB&amp;quot;, &amp;quot;PI&amp;quot;, &amp;quot;EI&amp;quot;),
                  hidden_units=c(df1[df1$mean==max(df1$mean),1, drop=TRUE],
                    df_r1[df_r1$mean==max(df_r1$mean),1, drop=TRUE],
                     df_UCB1[df_UCB1$mean==max(df_UCB1$mean),1, drop=TRUE],
                     df_PI1[df_PI1$mean==max(df_PI1$mean),1, drop=TRUE],
                     df_EI1[df_EI1$mean==max(df_EI1$mean),1, drop=TRUE]),
                  epochs=c(df1[df1$mean==max(df1$mean),2, drop=TRUE],
                    df_r1[df_r1$mean==max(df_r1$mean),2, drop=TRUE],
                     df_UCB1[df_UCB1$mean==max(df_UCB1$mean),2, drop=TRUE],
                     df_PI1[df_PI1$mean==max(df_PI1$mean),2, drop=TRUE],
                     df_EI1[df_EI1$mean==max(df_EI1$mean),2, drop=TRUE]),
                  roc_auc=c(max(df1$mean), max(df_r1$mean), max(df_UCB1$mean),
                       max(df_PI1$mean),max(df_EI1$mean)),
                  std_err=c(df1[df1$mean==max(df1$mean),ncol(df1), drop=TRUE],
                    df_r1[df_r1$mean==max(df_r1$mean),ncol(df_r1), drop=TRUE],
                     df_UCB1[df_UCB1$mean==max(df_UCB1$mean),ncol(df_UCB1), drop=TRUE],
                     df_PI1[df_PI1$mean==max(df_PI1$mean),ncol(df_PI1), drop=TRUE],
                     df_EI1[df_EI1$mean==max(df_EI1$mean),ncol(df_EI1), drop=TRUE]))
             
df_deep &lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;# A tibble: 5 x 5
  names  hidden_units epochs roc_auc std_err
  &amp;lt;chr&amp;gt;         &amp;lt;dbl&amp;gt;  &amp;lt;dbl&amp;gt;   &amp;lt;dbl&amp;gt;   &amp;lt;dbl&amp;gt;
1 true             12     18   0.982 0.00255
2 random           14     11   0.982 0.00230
3 UCB              12     19   0.981 0.00237
4 PI                7     20   0.981 0.00236
5 EI               10     20   0.981 0.00229&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;ggplot(df_deep[-1, ]%&amp;gt;% arrange(-roc_auc), aes(x=names, y=roc_auc))+
         geom_point(size=2, color= &amp;quot;red&amp;quot;)+
         geom_hline(yintercept = df_deep[df_deep$names==&amp;quot;true&amp;quot;, 4, drop=TRUE], 
                    color= &amp;quot;blue&amp;quot;, lty=&amp;quot;dashed&amp;quot;)+
         geom_errorbar(aes(ymin=roc_auc-1.92*std_err, ymax=roc_auc+1.92*std_err))&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://modelingwithr.rbind.io/bayes/hyper_bayes/2020-05-13-bayesian-hyperparameters-method_files/figure-html/unnamed-chunk-39-1.svg&#34; width=&#34;576&#34; /&gt;&lt;/p&gt;
&lt;p&gt;As we see, we do not obtain any significance difference between these methods, which is essentially due to the small size of the data and the small range used for the hyperparameter values. However, with more complex and large data set the difference will be obvious.&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div id=&#34;conclusion&#34; class=&#34;section level1&#34; number=&#34;6&#34;&gt;
&lt;h1&gt;&lt;span class=&#34;header-section-number&#34;&gt;6&lt;/span&gt; Conclusion&lt;/h1&gt;
&lt;p&gt;The main purpose of this article is more to show how implementing these methods in practice rather than highlighting the difference of performance between each other. However, the Bayesian optimization is known as the most efficient method, but with the downside that it also requires defining some hyperparameters like the acquisition function, sometimes arbitrarily. In other words, we cannot waste, say the limited budget, that should be concentrated to search for the optimum, to instead try different acquisition functions.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;session-info&#34; class=&#34;section level1&#34; number=&#34;7&#34;&gt;
&lt;h1&gt;&lt;span class=&#34;header-section-number&#34;&gt;7&lt;/span&gt; Session info&lt;/h1&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;sessionInfo()&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;R version 4.0.1 (2020-06-06)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19041)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252 
[2] LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] plot3D_1.3       themis_0.1.2     yardstick_0.0.7  workflows_0.2.0 
 [5] tune_0.1.1       tidyr_1.1.2      tibble_3.0.3     rsample_0.0.8   
 [9] recipes_0.1.13   purrr_0.3.4      parsnip_0.1.3    modeldata_0.0.2 
[13] infer_0.5.3      ggplot2_3.3.2    dplyr_1.0.2      dials_0.0.9     
[17] scales_1.1.1     broom_0.7.1      tidymodels_0.1.1 readr_1.3.1     

loaded via a namespace (and not attached):
 [1] lubridate_1.7.9    doParallel_1.0.15  DiceDesign_1.8-1   tools_4.0.1       
 [5] backports_1.1.10   utf8_1.1.4         R6_2.4.1           rpart_4.1-15      
 [9] colorspace_1.4-1   nnet_7.3-14        withr_2.3.0        tidyselect_1.1.0  
[13] compiler_4.0.1     parallelMap_1.5.0  cli_2.0.2          labeling_0.3      
[17] bookdown_0.20      checkmate_2.0.0    stringr_1.4.0      digest_0.6.25     
[21] rmarkdown_2.4      unbalanced_2.0     pkgconfig_2.0.3    htmltools_0.5.0   
[25] lhs_1.1.0          rlang_0.4.7        rstudioapi_0.11    BBmisc_1.11       
[29] FNN_1.1.3          farver_2.0.3       generics_0.0.2     magrittr_1.5      
[33] ROSE_0.0-3         Matrix_1.2-18      Rcpp_1.0.5         munsell_0.5.0     
[37] fansi_0.4.1        GPfit_1.0-8        lifecycle_0.2.0    furrr_0.1.0       
[41] stringi_1.5.3      pROC_1.16.2        yaml_2.2.1         MASS_7.3-53       
[45] plyr_1.8.6         misc3d_0.9-0       grid_4.0.1         parallel_4.0.1    
[49] listenv_0.8.0      crayon_1.3.4       lattice_0.20-41    splines_4.0.1     
[53] hms_0.5.3          knitr_1.30         mlr_2.17.1         pillar_1.4.6      
[57] tcltk_4.0.1        codetools_0.2-16   fastmatch_1.1-0    glue_1.4.2        
[61] evaluate_0.14      ParamHelpers_1.14  blogdown_0.20      data.table_1.13.0 
[65] vctrs_0.3.4        foreach_1.5.0      gtable_0.3.0       RANN_2.6.1        
[69] future_1.19.1      assertthat_0.2.1   xfun_0.18          gower_0.2.2       
[73] prodlim_2019.11.13 class_7.3-17       survival_3.2-7     timeDate_3043.102 
[77] iterators_1.0.12   lava_1.6.8         globals_0.13.0     ellipsis_0.3.1    
[81] ipred_0.9-9       &lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
</description>
    </item>
    
    <item>
      <title>deep learning model for titanic data</title>
      <link>https://modelingwithr.rbind.io/courses/deep-learning-model-for-titanic-data/</link>
      <pubDate>Wed, 13 May 2020 00:00:00 +0000</pubDate>
      <guid>https://modelingwithr.rbind.io/courses/deep-learning-model-for-titanic-data/</guid>
      <description>
&lt;script src=&#34;https://modelingwithr.rbind.io/rmarkdown-libs/header-attrs/header-attrs.js&#34;&gt;&lt;/script&gt;

&lt;div id=&#34;TOC&#34;&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&#34;#introduction&#34;&gt;Introduction&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#data-preparation&#34;&gt;Data preparation&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#partition-the-data-impute-the-missing-values.&#34;&gt;Partition the data &amp;amp; impute the missing values.&lt;/a&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&#34;#convert-the-data-into-a-numeric-matrix.&#34;&gt;Convert the data into a numeric matrix.&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#train-the-model.&#34;&gt;Train the model.&lt;/a&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&#34;#create-the-model&#34;&gt;Create the model&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#compile-the-model&#34;&gt;Compile the model&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#fit-the-model&#34;&gt;Fit the model&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#the-model-evaluation&#34;&gt;The model evaluation&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#model-tuning&#34;&gt;model tuning&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#conclusion&#34;&gt;Conclusion&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;

&lt;div id=&#34;introduction&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Introduction&lt;/h2&gt;
&lt;p&gt;Deep learning model belongs to the area of machine learning models which can be used either for supervised or unsupervised learning. Based on &lt;a href=&#34;https://www.digitaltrends.com/cool-tech/what-is-an-artificial-neural-network/&#34;&gt;artificial neural network&lt;/a&gt;, it can handle a wide variety of data types by using different neural network architectures such as &lt;a href=&#34;https://stanford.edu/~shervine/teaching/cs-230/cheatsheet-recurrent-neural-networks&#34;&gt;recurrent neural network RNN&lt;/a&gt; for sequence data (time series, text data etc.), &lt;a href=&#34;https://towardsdatascience.com/a-comprehensive-guide-to-convolutional-neural-networks-the-eli5-way-3bd2b1164a53&#34;&gt;convolutional neural network CNN&lt;/a&gt; for computer vision, &lt;a href=&#34;https://machinelearningmastery.com/what-are-generative-adversarial-networks-gans/&#34;&gt;generative adversarial network GAN&lt;/a&gt; for image generation and many other types of architecture.
The basic architecture of deep learning is the same as the classical artificial neural network (that has one hidden layer) with the difference that deep learning allows more than one hidden layer (this is where does the name deep come from ). Theses layers are called dense layers since that each node of a particular layer is connected with all the nodes of the previous layer, and in addition each node has an &lt;a href=&#34;https://missinglink.ai/guides/neural-network-concepts/7-types-neural-network-activation-functions-right/&#34;&gt;activation function&lt;/a&gt; to capture any nonlinearity in the data.&lt;/p&gt;
&lt;p&gt;In this article, we will use the basic deep learning model to predict the famous titanic data set (kaggle competition).&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;data-preparation&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Data preparation&lt;/h2&gt;
&lt;p&gt;We use the titanic data because of its familiarity with every one and hence focusing more on understanding and implementing our model. So Let’s call this data.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;ssh &amp;lt;- suppressPackageStartupMessages
ssh(library(tidyverse))&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## Warning: package &amp;#39;ggplot2&amp;#39; was built under R version 4.0.2&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## Warning: package &amp;#39;tibble&amp;#39; was built under R version 4.0.2&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## Warning: package &amp;#39;tidyr&amp;#39; was built under R version 4.0.2&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## Warning: package &amp;#39;dplyr&amp;#39; was built under R version 4.0.2&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;data &amp;lt;- read_csv(&amp;quot;C://Users/dell/Documents/new-blog/content/post/train.csv&amp;quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## Parsed with column specification:
## cols(
##   PassengerId = col_double(),
##   Survived = col_double(),
##   Pclass = col_double(),
##   Name = col_character(),
##   Sex = col_character(),
##   Age = col_double(),
##   SibSp = col_double(),
##   Parch = col_double(),
##   Ticket = col_character(),
##   Fare = col_double(),
##   Cabin = col_character(),
##   Embarked = col_character()
## )&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Then we will call &lt;strong&gt;keras&lt;/strong&gt; package for deep learning models, and &lt;strong&gt;caret&lt;/strong&gt; for randomly spliting the data and creating the confusion matrix.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;ssh(library(keras))
ssh(library(caret))&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The first step in modeling is to clean and prepare the data. the following code shows the structure of this data.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;glimpse(data)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## Rows: 891
## Columns: 12
## $ PassengerId &amp;lt;dbl&amp;gt; 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, ...
## $ Survived    &amp;lt;dbl&amp;gt; 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0...
## $ Pclass      &amp;lt;dbl&amp;gt; 3, 1, 3, 1, 3, 3, 1, 3, 3, 2, 3, 1, 3, 3, 3, 2, 3, 2, 3...
## $ Name        &amp;lt;chr&amp;gt; &amp;quot;Braund, Mr. Owen Harris&amp;quot;, &amp;quot;Cumings, Mrs. John Bradley ...
## $ Sex         &amp;lt;chr&amp;gt; &amp;quot;male&amp;quot;, &amp;quot;female&amp;quot;, &amp;quot;female&amp;quot;, &amp;quot;female&amp;quot;, &amp;quot;male&amp;quot;, &amp;quot;male&amp;quot;, &amp;quot;...
## $ Age         &amp;lt;dbl&amp;gt; 22, 38, 26, 35, 35, NA, 54, 2, 27, 14, 4, 58, 20, 39, 1...
## $ SibSp       &amp;lt;dbl&amp;gt; 1, 1, 0, 1, 0, 0, 0, 3, 0, 1, 1, 0, 0, 1, 0, 0, 4, 0, 1...
## $ Parch       &amp;lt;dbl&amp;gt; 0, 0, 0, 0, 0, 0, 0, 1, 2, 0, 1, 0, 0, 5, 0, 0, 1, 0, 0...
## $ Ticket      &amp;lt;chr&amp;gt; &amp;quot;A/5 21171&amp;quot;, &amp;quot;PC 17599&amp;quot;, &amp;quot;STON/O2. 3101282&amp;quot;, &amp;quot;113803&amp;quot;, ...
## $ Fare        &amp;lt;dbl&amp;gt; 7.2500, 71.2833, 7.9250, 53.1000, 8.0500, 8.4583, 51.86...
## $ Cabin       &amp;lt;chr&amp;gt; NA, &amp;quot;C85&amp;quot;, NA, &amp;quot;C123&amp;quot;, NA, NA, &amp;quot;E46&amp;quot;, NA, NA, NA, &amp;quot;G6&amp;quot;,...
## $ Embarked    &amp;lt;chr&amp;gt; &amp;quot;S&amp;quot;, &amp;quot;C&amp;quot;, &amp;quot;S&amp;quot;, &amp;quot;S&amp;quot;, &amp;quot;S&amp;quot;, &amp;quot;Q&amp;quot;, &amp;quot;S&amp;quot;, &amp;quot;S&amp;quot;, &amp;quot;S&amp;quot;, &amp;quot;C&amp;quot;, &amp;quot;S&amp;quot;, ...&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Using this data we want to predict the variable &lt;strong&gt;Survived&lt;/strong&gt; using the remaining variables as predictors. We see that some variables have unique values such as &lt;strong&gt;PassengerId&lt;/strong&gt;,&lt;strong&gt;Name&lt;/strong&gt;, and &lt;strong&gt;ticket&lt;/strong&gt;. Thus, they cannot be used as predictors. the same note applies to the variable &lt;strong&gt;Cabin&lt;/strong&gt; with the additional problem of missing values. these variables will be removed as follows:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;mydata&amp;lt;-data[,-c(1,4,9,11)]&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;As we see some variables should be of factor type such as &lt;strong&gt;Pclass&lt;/strong&gt; (which is now double), &lt;strong&gt;Sex&lt;/strong&gt; (character), and &lt;strong&gt;Embarked&lt;/strong&gt; (character). thus, we convert them to factor type:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;mydata &amp;lt;- mydata %&amp;gt;%  modify_at(c(&amp;#39;Pclass&amp;#39;, &amp;#39;Embarked&amp;#39;, &amp;#39;Sex&amp;#39; ), as.factor)
glimpse(mydata)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## Rows: 891
## Columns: 8
## $ Survived &amp;lt;dbl&amp;gt; 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1...
## $ Pclass   &amp;lt;fct&amp;gt; 3, 1, 3, 1, 3, 3, 1, 3, 3, 2, 3, 1, 3, 3, 3, 2, 3, 2, 3, 3...
## $ Sex      &amp;lt;fct&amp;gt; male, female, female, female, male, male, male, male, fema...
## $ Age      &amp;lt;dbl&amp;gt; 22, 38, 26, 35, 35, NA, 54, 2, 27, 14, 4, 58, 20, 39, 14, ...
## $ SibSp    &amp;lt;dbl&amp;gt; 1, 1, 0, 1, 0, 0, 0, 3, 0, 1, 1, 0, 0, 1, 0, 0, 4, 0, 1, 0...
## $ Parch    &amp;lt;dbl&amp;gt; 0, 0, 0, 0, 0, 0, 0, 1, 2, 0, 1, 0, 0, 5, 0, 0, 1, 0, 0, 0...
## $ Fare     &amp;lt;dbl&amp;gt; 7.2500, 71.2833, 7.9250, 53.1000, 8.0500, 8.4583, 51.8625,...
## $ Embarked &amp;lt;fct&amp;gt; S, C, S, S, S, Q, S, S, S, C, S, S, S, S, S, S, Q, S, S, C...&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Now let’s get some summary about this data&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;summary(mydata)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;##     Survived      Pclass      Sex           Age            SibSp      
##  Min.   :0.0000   1:216   female:314   Min.   : 0.42   Min.   :0.000  
##  1st Qu.:0.0000   2:184   male  :577   1st Qu.:20.12   1st Qu.:0.000  
##  Median :0.0000   3:491                Median :28.00   Median :0.000  
##  Mean   :0.3838                        Mean   :29.70   Mean   :0.523  
##  3rd Qu.:1.0000                        3rd Qu.:38.00   3rd Qu.:1.000  
##  Max.   :1.0000                        Max.   :80.00   Max.   :8.000  
##                                        NA&amp;#39;s   :177                    
##      Parch             Fare        Embarked  
##  Min.   :0.0000   Min.   :  0.00   C   :168  
##  1st Qu.:0.0000   1st Qu.:  7.91   Q   : 77  
##  Median :0.0000   Median : 14.45   S   :644  
##  Mean   :0.3816   Mean   : 32.20   NA&amp;#39;s:  2  
##  3rd Qu.:0.0000   3rd Qu.: 31.00             
##  Max.   :6.0000   Max.   :512.33             
## &lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;We have two variables that have missing values, &lt;strong&gt;Age&lt;/strong&gt; with large number 177 , followed by &lt;strong&gt;Embarked&lt;/strong&gt; with 2 missing values.
To deal with this issue we have two options:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;p&gt;the first and easy one is to remove the entire rows that have any missing value but with the cost of may losing valuable information specially when we have large number of missing values compared to the total number of obervations as our case.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;the second option is to impute this missing values using the other complete cases, for instance we can replace a missing value of a particular column by the mean of this column (for numeric variable) or we use multinomial method to predict the categorical variables.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Fortunately , there is a useful package called &lt;strong&gt;mice&lt;/strong&gt; which will do this imputation for us. However, applying this imputation on the entire data would lead us to fall on a problem called &lt;strong&gt;train-test contamination&lt;/strong&gt; ,which means that when we split the data , the missing values of the training set are imputed using cases in the test set, and this violates a crucial concept in machine learning for model evaluation, the test set should never be seen by the model during the training process.&lt;/p&gt;
&lt;p&gt;To avoid this problem we apply the imputation separately on the training set and on the testing set.
So let’s partition the data using &lt;strong&gt;caret&lt;/strong&gt; package function.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;partition-the-data-impute-the-missing-values.&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Partition the data &amp;amp; impute the missing values.&lt;/h2&gt;
&lt;p&gt;we randomly split the data into two sets , 80% of samples will be used in the training process and the remaining 20% will be kept as test set.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;set.seed(1234)
index&amp;lt;-createDataPartition(mydata$Survived,p=0.8,list=FALSE)
train&amp;lt;-mydata[index,]&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## Warning: The `i` argument of ``[`()` can&amp;#39;t be a matrix as of tibble 3.0.0.
## Convert to a vector.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;test&amp;lt;-mydata[-index,]&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Now we are ready to impute the missing values for both train and test set.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;ssh(library(mice))&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## Warning: package &amp;#39;mice&amp;#39; was built under R version 4.0.2&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;impute_train&amp;lt;-mice(train,m=1,seed = 1111)
train&amp;lt;-complete(impute_train,1)

impute_test&amp;lt;-mice(test,m=1,seed = 1111)
test&amp;lt;-complete(impute_test,1)&lt;/code&gt;&lt;/pre&gt;
&lt;div id=&#34;convert-the-data-into-a-numeric-matrix.&#34; class=&#34;section level3&#34;&gt;
&lt;h3&gt;Convert the data into a numeric matrix.&lt;/h3&gt;
&lt;p&gt;in deep learning all the variables should of numeric type, so first we convert the factors to integer type and recode the levels in order to start from 0, then we convert the data into matrix, and finally we pull out the target variable into a separate vector.
We do this transformation for both sets (train and test).&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;train$Embarked&amp;lt;-as.integer(train$Embarked)-1
train$Sex&amp;lt;-as.integer(train$Sex)-1
train$Pclass&amp;lt;-as.integer(train$Pclass)-1

test$Embarked&amp;lt;-as.integer(test$Embarked)-1
test$Sex&amp;lt;-as.integer(test$Sex)-1
test$Pclass&amp;lt;-as.integer(test$Pclass)-1
glimpse(test)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## Rows: 178
## Columns: 8
## $ Survived &amp;lt;dbl&amp;gt; 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 1, 0...
## $ Pclass   &amp;lt;dbl&amp;gt; 2, 2, 2, 1, 1, 2, 2, 1, 2, 2, 2, 0, 0, 2, 2, 2, 1, 2, 2, 2...
## $ Sex      &amp;lt;dbl&amp;gt; 1, 1, 0, 0, 1, 0, 0, 0, 1, 1, 0, 1, 0, 1, 0, 0, 1, 1, 0, 1...
## $ Age      &amp;lt;dbl&amp;gt; 35.0, 2.0, 27.0, 55.0, 38.0, 23.0, 38.0, 3.0, 28.0, 34.5, ...
## $ SibSp    &amp;lt;dbl&amp;gt; 0, 3, 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 4, 5, 0, 0, 0, 0...
## $ Parch    &amp;lt;dbl&amp;gt; 0, 1, 2, 0, 0, 0, 5, 2, 0, 0, 0, 0, 0, 0, 2, 2, 0, 0, 0, 0...
## $ Fare     &amp;lt;dbl&amp;gt; 8.0500, 21.0750, 11.1333, 16.0000, 13.0000, 7.2250, 31.387...
## $ Embarked &amp;lt;dbl&amp;gt; 2, 2, 2, 2, 2, 0, 2, 0, 2, 1, 2, 2, 0, 2, 2, 2, 2, 2, 2, 2...&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;strong&gt;Note&lt;/strong&gt;: If you noticed the varaibles &lt;strong&gt;Pclass&lt;/strong&gt;, &lt;strong&gt;Embarked&lt;/strong&gt;, and &lt;strong&gt;Sex&lt;/strong&gt;, originally were numeric but we have converted them to factors for an appropriate imputation in the imputation step, if not doing so the imputation of Embarked missing values, for instance, could be any other numeric values which are not related to any ports in the data.&lt;/p&gt;
&lt;p&gt;we convert the two sets into matrix form. (we also remove the column names)&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;trained&amp;lt;-as.matrix(train)
dimnames(trained)&amp;lt;-NULL

tested&amp;lt;-as.matrix(test)
dimnames(tested)&amp;lt;-NULL
str(tested)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;##  num [1:178, 1:8] 0 0 1 1 1 1 1 1 0 0 ...&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Now we pull out the target variable&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;trainy&amp;lt;-trained[,1]
testy&amp;lt;-tested[,1]
trainx&amp;lt;-trained[,-1]
testx&amp;lt;-tested[,-1]&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Then we Apply one hot encoding on the target variable.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;trainlabel&amp;lt;-to_categorical(trainy)
testlabel&amp;lt;-to_categorical(testy)&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div id=&#34;train-the-model.&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Train the model.&lt;/h2&gt;
&lt;p&gt;Now it is time to build our model. Th first step is to define the model architecture and the number of layers that will be used with the prespecified parameters.
We will choose a simple model with one hidden layer with 10 unites (nodes). Since we have 7 predictors the input_shape will be 7, and the activation function is &lt;strong&gt;relu&lt;/strong&gt; which is the most used one, but for the output layer we choose sigmoid function since we have binary classification.&lt;/p&gt;
&lt;div id=&#34;create-the-model&#34; class=&#34;section level3&#34;&gt;
&lt;h3&gt;Create the model&lt;/h3&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;model &amp;lt;- keras_model_sequential()

model %&amp;gt;%
    layer_dense(units=10,activation = &amp;quot;relu&amp;quot;,
              kernel_initializer = &amp;quot;he_normal&amp;quot;,input_shape =c(7))%&amp;gt;%
    layer_dense(units=2,activation = &amp;quot;sigmoid&amp;quot;)

summary(model)  &lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## Model: &amp;quot;sequential&amp;quot;
## ________________________________________________________________________________
## Layer (type)                        Output Shape                    Param #     
## ================================================================================
## dense (Dense)                       (None, 10)                      80          
## ________________________________________________________________________________
## dense_1 (Dense)                     (None, 2)                       22          
## ================================================================================
## Total params: 102
## Trainable params: 102
## Non-trainable params: 0
## ________________________________________________________________________________&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;We have in total 102 parameters to estimate, since we have 7 inputs and 10 nodes and 10 biases, so the parameters number of the hidden layer is 80 (7*10+10). By the same way get the parameters number of the output layer.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;compile-the-model&#34; class=&#34;section level3&#34;&gt;
&lt;h3&gt;Compile the model&lt;/h3&gt;
&lt;p&gt;In the &lt;strong&gt;compile&lt;/strong&gt; function (from keras) we specify the loss function, the optimizer and the metric type that will be used. In our case we use the &lt;strong&gt;binary crossentropy&lt;/strong&gt;, the optimizer is the popular one &lt;strong&gt;adam&lt;/strong&gt; and for the metric we use &lt;strong&gt;accuracy&lt;/strong&gt;.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;model %&amp;gt;%
  compile(loss=&amp;quot;binary_crossentropy&amp;quot;,
          optimizer=&amp;quot;adam&amp;quot;,
          metric=&amp;quot;accuracy&amp;quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div id=&#34;fit-the-model&#34; class=&#34;section level3&#34;&gt;
&lt;h3&gt;Fit the model&lt;/h3&gt;
&lt;p&gt;Now we can run our model and follow the dynamic evolution of the process in the plot window on the right lower corner of the screen. and you can also plot the model in a static way.
for our model we choose 100 epochs (iterations), for the stochastic gradient we use 20 samples at each iteration, and we hold out 20% of the training data to asses the model.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;#history&amp;lt;- model %&amp;gt;%
# fit(trainx,trainlabel,epoch=100,batch_size=20,validation_split=0.2)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;strong&gt;Note&lt;/strong&gt; : if you would like to rerun the model uncomment the above code.&lt;/p&gt;
&lt;p&gt;We can extract the five last metric values from the history object as follows.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;#df &amp;lt;- tibble(train_loss=history$metrics$loss, valid_loss=history$metrics$val_loss,
#      train_acc=history$metrics$accuracy, valid_acc=history$metrics$val_accuracy)
#write_csv(df,&amp;quot;df.csv&amp;quot;)
df &amp;lt;- read.csv(&amp;quot;df.csv&amp;quot;)
tail(df,5)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;##     train_loss valid_loss train_acc valid_acc
## 96   0.4600244  0.4038978 0.7850877 0.8146853
## 97   0.4655294  0.4080083 0.7850877 0.8181818
## 98   0.4616975  0.4048636 0.7894737 0.8286713
## 99   0.4634421  0.4092717 0.7929825 0.8216783
## 100  0.4639769  0.4116935 0.7789474 0.8216783&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;It should be noted here that since the accuracy lines are more or less closer to each other and running together in the same direction we do not have to worry about overfitting, The opposite though is more pronounce since the accuracy of the training samples is less than that of the validation samples (underfitting), so we should increase the complexity of the model (by adding more nodes or more layers).&lt;/p&gt;
&lt;p&gt;We can save this model (or save only the wieghts) and load it again for further use.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;#save_model_hdf5(model,&amp;quot;simplemodel.h5&amp;quot;)
model&amp;lt;-load_model_hdf5(&amp;quot;simplemodel.h5&amp;quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div id=&#34;the-model-evaluation&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;The model evaluation&lt;/h2&gt;
&lt;p&gt;Let’s evaluate our model using both the training set then the testing set.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;train_eva &amp;lt;- model %&amp;gt;%
  evaluate(trainx,trainlabel)
test_eva &amp;lt;- model %&amp;gt;% 
  evaluate(testx, testlabel) 
tibble(train_acc= train_eva[[&amp;quot;accuracy&amp;quot;]], test_acc= test_eva[[&amp;quot;accuracy&amp;quot;]], train_loss=train_eva[[&amp;quot;loss&amp;quot;]],test_loss=test_eva[[&amp;quot;loss&amp;quot;]])&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The accuracy rate of the model using the test set is 80.89% which is higher than that of the training set (79.92%) which means that this model needs more improvement.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;model-tuning&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;model tuning&lt;/h2&gt;
&lt;p&gt;Let’s now include another hidden layer with 20 nodes, and let’s also increase the number of epochs to 200. In addition, as we did with the above model we should save our optimal model.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;model1 &amp;lt;- keras_model_sequential()

model1 %&amp;gt;%
    layer_dense(units=10,activation = &amp;quot;relu&amp;quot;,
              kernel_initializer = &amp;quot;he_normal&amp;quot;,input_shape =c(7)) %&amp;gt;%
    layer_dense(units=20, activation = &amp;quot;relu&amp;quot;,
              kernel_initializer = &amp;quot;he_normal&amp;quot;) %&amp;gt;%
    layer_dense(units=2,activation = &amp;quot;sigmoid&amp;quot;)

model1 %&amp;gt;%
  compile(loss=&amp;quot;binary_crossentropy&amp;quot;,
          optimizer=&amp;quot;adam&amp;quot;,
          metric=&amp;quot;accuracy&amp;quot;)

#history1&amp;lt;- model1 %&amp;gt;%
#   fit (trainx,trainlabel,epoch=200,batch_size=40,validation_split=0.2)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Before evaluation we should save it.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;#save_model_hdf5(model,&amp;quot;simplemodel1.h5&amp;quot;)
model1&amp;lt;-load_model_hdf5(&amp;quot;simplemodel1.h5&amp;quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Let’s evaluate this new model.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;train_eva &amp;lt;- model1 %&amp;gt;%
  evaluate(trainx,trainlabel)
test_eva &amp;lt;- model1 %&amp;gt;% 
  evaluate(testx, testlabel)
tibble(train_acc= train_eva[[&amp;quot;accuracy&amp;quot;]], test_acc= test_eva[[&amp;quot;accuracy&amp;quot;]], train_loss=train_eva[[&amp;quot;loss&amp;quot;]],test_loss=test_eva[[&amp;quot;loss&amp;quot;]])&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;with this new model we get a larger improvement with both accuracies. We can go back again to our model and try to increase the nodes or the layers or playing around with other parameters to get better results.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;conclusion&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Conclusion&lt;/h2&gt;
&lt;p&gt;Practically, deep learning models are more efficient than most of the classical machine learning models when it comes to fit complex and large data sets. Moreover, some type of data such as images or speeches are exclusively the areas where deep learning rises its great capability.&lt;/p&gt;
&lt;/div&gt;
</description>
    </item>
    
    <item>
      <title>Time series with ARIMA and RNN models</title>
      <link>https://modelingwithr.rbind.io/courses/rnn/time-series-with-recurrent-neaural-network-rnn-lstm-model/</link>
      <pubDate>Tue, 05 May 2020 00:00:00 +0000</pubDate>
      <guid>https://modelingwithr.rbind.io/courses/rnn/time-series-with-recurrent-neaural-network-rnn-lstm-model/</guid>
      <description>
&lt;script src=&#34;https://modelingwithr.rbind.io/rmarkdown-libs/header-attrs/header-attrs.js&#34;&gt;&lt;/script&gt;

&lt;div id=&#34;TOC&#34;&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&#34;#introduction&#34;&gt;&lt;span class=&#34;toc-section-number&#34;&gt;1&lt;/span&gt; Introduction&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#data-preparation&#34;&gt;&lt;span class=&#34;toc-section-number&#34;&gt;2&lt;/span&gt; Data preparation&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#arima-model&#34;&gt;&lt;span class=&#34;toc-section-number&#34;&gt;3&lt;/span&gt; ARIMA model&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#rnn-model&#34;&gt;&lt;span class=&#34;toc-section-number&#34;&gt;4&lt;/span&gt; RNN model&lt;/a&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&#34;#reshape-the-time-series&#34;&gt;&lt;span class=&#34;toc-section-number&#34;&gt;4.0.1&lt;/span&gt; Reshape the time series&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#model-architecture&#34;&gt;&lt;span class=&#34;toc-section-number&#34;&gt;4.1&lt;/span&gt; Model architecture&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#model-training&#34;&gt;&lt;span class=&#34;toc-section-number&#34;&gt;4.2&lt;/span&gt; Model training&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#prediction&#34;&gt;&lt;span class=&#34;toc-section-number&#34;&gt;4.3&lt;/span&gt; Prediction&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#results-comparison&#34;&gt;&lt;span class=&#34;toc-section-number&#34;&gt;5&lt;/span&gt; results comparison&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#conclusion&#34;&gt;&lt;span class=&#34;toc-section-number&#34;&gt;6&lt;/span&gt; Conclusion&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#further-reading&#34;&gt;&lt;span class=&#34;toc-section-number&#34;&gt;7&lt;/span&gt; Further reading&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#session-info&#34;&gt;&lt;span class=&#34;toc-section-number&#34;&gt;8&lt;/span&gt; Session info&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;

&lt;style type=&#34;text/css&#34;&gt;
strong {
  color: Navy;
}

h1,h2, h3, h4 {
  font-size:28px;
  color:DarkBlue;
}
&lt;/style&gt;
&lt;div id=&#34;introduction&#34; class=&#34;section level1&#34; number=&#34;1&#34;&gt;
&lt;h1&gt;&lt;span class=&#34;header-section-number&#34;&gt;1&lt;/span&gt; Introduction&lt;/h1&gt;
&lt;p&gt;The classical methods for predicting univariate time series are &lt;a href=&#34;https://otexts.com/fpp2/arima.html&#34;&gt;ARIMA&lt;/a&gt; models (under linearity assumption and provided that the non stationarity is of type DS) that use the autocorrelation function (up to some order) to predict the target variable based on its own past values (Autoregressive part) and the past values of the errors (moving average part) in a linear function . However, the hardest step in ARIMA models is to derive stationary series from non stationary series that exhibits less well defined trend (deterministic or stochastic) or seasonality. The RNN model, proposed by John Hopfield (1982), is a deep learning model that does not need the above requirements (the type of non stationarity and linearity) and can capture and model the memory of the time series, which is the main characteristic of some type of sequence data, in addition to time series, such as &lt;strong&gt;text data&lt;/strong&gt;, &lt;strong&gt;image captioning&lt;/strong&gt;, &lt;strong&gt;speech recognition&lt;/strong&gt; .. etc.&lt;/p&gt;
&lt;p&gt;The basic idea behind RNN is very simple (As described in the plot below). At each time step &lt;strong&gt;t&lt;/strong&gt; the model compute a state value &lt;span class=&#34;math inline&#34;&gt;\(h_t\)&lt;/span&gt; that combines (in linear combination) the previous state &lt;span class=&#34;math inline&#34;&gt;\(h_{t-1}\)&lt;/span&gt; (which contains all the memory available at time &lt;strong&gt;t-1&lt;/strong&gt; ) and the current input &lt;span class=&#34;math inline&#34;&gt;\(x_t\)&lt;/span&gt; (which is the current value of the time series), passing then the result to the activation function &lt;strong&gt;tanh&lt;/strong&gt; (to capture any nonlinearity relations). The state at each time step t thus can formally be expressed as follows:&lt;/p&gt;
&lt;p&gt;&lt;span class=&#34;math display&#34;&gt;\[h_t=tanh(W_h.h_{t-1}+W_x.x_t+b)\]&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;And then we leave the work to the gradient descent to decide how much memory we keep by computing the optimum weights &lt;span class=&#34;math inline&#34;&gt;\(W_h\)&lt;/span&gt;.
Similarely, the output &lt;span class=&#34;math display&#34;&gt;\[y_t\]&lt;/span&gt; will be computed by the following:
&lt;span class=&#34;math display&#34;&gt;\[y_t=W_y.h_t\]&lt;/span&gt;&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;img &amp;lt;- EBImage::readImage(&amp;quot;C://Users/dell/Documents/new-blog/content/courses/rnn/rnn_plot.jpg&amp;quot;)
plot(img)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://modelingwithr.rbind.io/courses/rnn/2020-05-05-time-series-with-recurrent-neaural-network-rnn-lstm-model_files/figure-html/unnamed-chunk-2-1.svg&#34; width=&#34;576&#34; /&gt;&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;data-preparation&#34; class=&#34;section level1&#34; number=&#34;2&#34;&gt;
&lt;h1&gt;&lt;span class=&#34;header-section-number&#34;&gt;2&lt;/span&gt; Data preparation&lt;/h1&gt;
&lt;p&gt;First let’s call the packages needed for our analysis&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;ssh &amp;lt;- suppressPackageStartupMessages
ssh(library(timeSeries))
ssh(library(tseries))
ssh(library(aTSA))
ssh(library(forecast))&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## Warning: package &amp;#39;forecast&amp;#39; was built under R version 4.0.2&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;ssh(library(rugarch))&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## Warning: package &amp;#39;rugarch&amp;#39; was built under R version 4.0.2&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;ssh(library(ModelMetrics))
ssh(library(keras))&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;In this article we will use the data &lt;strong&gt;USDCHF&lt;/strong&gt; from the &lt;strong&gt;timeSeries&lt;/strong&gt; package which is the univariate series of the intraday foreign exchange rates between US dollar and Swiss franc with &lt;strong&gt;62496&lt;/strong&gt; observations.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;data(USDCHF)
length(USDCHF)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## [1] 62496&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Let’s look at this data by the following plot after converting it to ts object.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;data(USDCHF)
data &amp;lt;- ts(USDCHF, frequency = 365)
plot(data)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://modelingwithr.rbind.io/courses/rnn/2020-05-05-time-series-with-recurrent-neaural-network-rnn-lstm-model_files/figure-html/unnamed-chunk-5-1.svg&#34; width=&#34;576&#34; /&gt;&lt;/p&gt;
&lt;p&gt;This series seems to have a trend and it is not stationary, but let’s verify this by the &lt;a href=&#34;https://faculty.washington.edu/ezivot/econ584/notes/unitroot.pdf&#34;&gt;dickey fuller&lt;/a&gt; and &lt;a href=&#34;https://faculty.washington.edu/ezivot/econ584/notes/unitroot.pdf&#34;&gt;philip perron&lt;/a&gt; tests&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;adf.test(data)
pp.test(data )&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Both tests confirm that the data has unit roots(high p-value: we do not reject the null hypothesis). We can also check the correlogram of the autocorrelation function
&lt;a href=&#34;https://towardsdatascience.com/significance-of-acf-and-pacf-plots-in-time-series-analysis-2fa11a5d10a8&#34;&gt;acf&lt;/a&gt; and the Partial autocorrelation function &lt;a href=&#34;https://towardsdatascience.com/significance-of-acf-and-pacf-plots-in-time-series-analysis-2fa11a5d10a8&#34;&gt;pacf&lt;/a&gt; as follows:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;acf(data)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://modelingwithr.rbind.io/courses/rnn/2020-05-05-time-series-with-recurrent-neaural-network-rnn-lstm-model_files/figure-html/unnamed-chunk-7-1.svg&#34; width=&#34;576&#34; /&gt;&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;pacf(data)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://modelingwithr.rbind.io/courses/rnn/2020-05-05-time-series-with-recurrent-neaural-network-rnn-lstm-model_files/figure-html/unnamed-chunk-7-2.svg&#34; width=&#34;576&#34; /&gt;&lt;/p&gt;
&lt;p&gt;As you know the ACF is related to the MA part and PACF to the AR part, so since in the pacf we have one bar that exceeds far away the confidence interval we are confident that our data has unit root and we can get ride of it by differencing the data by one. In ARIMA terms the data should be integrated by 1 (d=1), and this the &lt;strong&gt;I&lt;/strong&gt; part of arima. In addition, since we do not have a decay of bars in PACF, the model would not have any lag included in the AR part.
Whereas, from the ACF plot, all the bars are highly far from the confidence interval then the model would have many lags of MA part.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;arima-model&#34; class=&#34;section level1&#34; number=&#34;3&#34;&gt;
&lt;h1&gt;&lt;span class=&#34;header-section-number&#34;&gt;3&lt;/span&gt; ARIMA model&lt;/h1&gt;
&lt;p&gt;To fit an ARIMA model we have to determine the lag of the AR (p) and MA(q) components and how many times we integrate the series to be stationary (d). Fortunately, we do not have to worry about these issues, we leave everything to the &lt;strong&gt;forcast&lt;/strong&gt; package that provides a fast way to get the best model by calling the function &lt;strong&gt;auto.arima&lt;/strong&gt;. But before that let’s held out the last 100 observations to be used as testing data in order to compare the quality of this model and the RNN model.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;data_test &amp;lt;- data[(length(data)-99):length(data)]
data_train &amp;lt;- data[1:(length(data)-99-1)]&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;model_arima &amp;lt;- auto.arima(data_train)
summary(model_arima)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## Series: data_train 
## ARIMA(0,1,2) with drift 
## 
## Coefficients:
##           ma1     ma2  drift
##       -0.0193  0.0113      0
## s.e.   0.0040  0.0040      0
## 
## sigma^2 estimated as 2.29e-06:  log likelihood=316634.5
## AIC=-633260.9   AICc=-633260.9   BIC=-633224.8
## 
## Training set error measures:
##                        ME        RMSE          MAE           MPE       MAPE
## Training set 1.900607e-08 0.001513064 0.0009922846 -3.671242e-05 0.06627114
##                  MASE          ACF1
## Training set 0.999585 -3.921999e-05&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;As expected this model is an ARIMA(0,1,2) integrated by 1 (differenced series is now stationary) and has two MA lags without &lt;strong&gt;drift&lt;/strong&gt; (constant). The output also has some metric values like Root mean square error &lt;strong&gt;RMSE&lt;/strong&gt; and mean absolute error &lt;strong&gt;MAE&lt;/strong&gt; which are the most popular ones. we will use later on this metric to compare this model with the RNN model.
To validate this model we have to make sure that the residuals are white noise without any problems such as autocorrelation or &lt;a href=&#34;https://www.investopedia.com/terms/h/heteroskedasticity.asp&#34;&gt;heterskedasticity&lt;/a&gt;. Thankfully to &lt;strong&gt;forecast&lt;/strong&gt; package we can check the residual straightforwardly by calling the function &lt;strong&gt;checkresiduals&lt;/strong&gt;&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;checkresiduals(model_arima)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://modelingwithr.rbind.io/courses/rnn/2020-05-05-time-series-with-recurrent-neaural-network-rnn-lstm-model_files/figure-html/unnamed-chunk-10-1.svg&#34; width=&#34;576&#34; /&gt;&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(0,1,2) with drift
## Q* = 8.6631, df = 7, p-value = 0.2778
## 
## Model df: 3.   Total lags used: 10&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Since the p-value is far larger than the significance level 5% we do not reject the null hypothesis that the errors are not autocorrelated. However, by looking at the ACF plot we have some bars that go outside the confidence interval, but this can be expected by the significance level of 5% (as false positive). So we can confirm the non correlation with 95% of confidence.
For possible heteroskedasticity we use &lt;a href=&#34;https://hal.archives-ouvertes.fr/hal-00588680/document&#34;&gt;ARCH_LM&lt;/a&gt; statistic from the package &lt;strong&gt;aTSA&lt;/strong&gt; package.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;arch.test(arima(data_train, order = c(0,1,2)))&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://modelingwithr.rbind.io/courses/rnn/2020-05-05-time-series-with-recurrent-neaural-network-rnn-lstm-model_files/figure-html/unnamed-chunk-11-1.svg&#34; width=&#34;576&#34; /&gt;&lt;/p&gt;
&lt;p&gt;We see that both test are highly significant (we reject the null hypothesis of homoskedasticity), so the above arima model is not able to capture such pattern. That is why we should join to the above model another model that keeps track of this type of patterns which is called &lt;a href=&#34;https://medium.com/auquan/time-series-analysis-for-finance-arch-garch-models-822f87f1d755&#34;&gt;GARCH&lt;/a&gt; model.
The garch model attempts to model the residuals of the ARIMA model with the general following formula:
&lt;span class=&#34;math display&#34;&gt;\[\epsilon_t=w_t\sqrt{h_t}\]&lt;/span&gt;
&lt;span class=&#34;math display&#34;&gt;\[h_t=w_t\sqrt{a_0+\sum_{i=1}^{p}a_i.\epsilon_{t-i}^2+\sum_{j=1}^{q}b_j.h_{t-j}}\]&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;Where &lt;span class=&#34;math inline&#34;&gt;\(w_t\)&lt;/span&gt; is white noise error.&lt;/p&gt;
&lt;p&gt;So we fit this model for different lags by calling the function &lt;strong&gt;garch&lt;/strong&gt; from the package &lt;strong&gt;tseries&lt;/strong&gt;, and we use the &lt;strong&gt;AIC&lt;/strong&gt; criterion to get the best model.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;model &amp;lt;- character()
AIC &amp;lt;- numeric()
for (p in 1:5){
  for(q in 1:5){
    model_g &amp;lt;- tseries::garch(model_arima$residuals, order = c(p,q), trace=F)
    model&amp;lt;-c(model,paste(&amp;quot;mod_&amp;quot;, p, q))
    AIC &amp;lt;- c(AIC, AIC(model_g))
    def &amp;lt;- tibble::tibble(model,AIC)
  }
}&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## Warning in tseries::garch(model_arima$residuals, order = c(p, q), trace = F):
## singular information

## Warning in tseries::garch(model_arima$residuals, order = c(p, q), trace = F):
## singular information

## Warning in tseries::garch(model_arima$residuals, order = c(p, q), trace = F):
## singular information

## Warning in tseries::garch(model_arima$residuals, order = c(p, q), trace = F):
## singular information

## Warning in tseries::garch(model_arima$residuals, order = c(p, q), trace = F):
## singular information

## Warning in tseries::garch(model_arima$residuals, order = c(p, q), trace = F):
## singular information

## Warning in tseries::garch(model_arima$residuals, order = c(p, q), trace = F):
## singular information

## Warning in tseries::garch(model_arima$residuals, order = c(p, q), trace = F):
## singular information

## Warning in tseries::garch(model_arima$residuals, order = c(p, q), trace = F):
## singular information

## Warning in tseries::garch(model_arima$residuals, order = c(p, q), trace = F):
## singular information

## Warning in tseries::garch(model_arima$residuals, order = c(p, q), trace = F):
## singular information

## Warning in tseries::garch(model_arima$residuals, order = c(p, q), trace = F):
## singular information

## Warning in tseries::garch(model_arima$residuals, order = c(p, q), trace = F):
## singular information

## Warning in tseries::garch(model_arima$residuals, order = c(p, q), trace = F):
## singular information

## Warning in tseries::garch(model_arima$residuals, order = c(p, q), trace = F):
## singular information

## Warning in tseries::garch(model_arima$residuals, order = c(p, q), trace = F):
## singular information

## Warning in tseries::garch(model_arima$residuals, order = c(p, q), trace = F):
## singular information

## Warning in tseries::garch(model_arima$residuals, order = c(p, q), trace = F):
## singular information

## Warning in tseries::garch(model_arima$residuals, order = c(p, q), trace = F):
## singular information

## Warning in tseries::garch(model_arima$residuals, order = c(p, q), trace = F):
## singular information

## Warning in tseries::garch(model_arima$residuals, order = c(p, q), trace = F):
## singular information

## Warning in tseries::garch(model_arima$residuals, order = c(p, q), trace = F):
## singular information

## Warning in tseries::garch(model_arima$residuals, order = c(p, q), trace = F):
## singular information

## Warning in tseries::garch(model_arima$residuals, order = c(p, q), trace = F):
## singular information

## Warning in tseries::garch(model_arima$residuals, order = c(p, q), trace = F):
## singular information&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;def %&amp;gt;% dplyr::arrange(AIC)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## # A tibble: 25 x 2
##    model         AIC
##    &amp;lt;chr&amp;gt;       &amp;lt;dbl&amp;gt;
##  1 mod_ 1 1 -647018.
##  2 mod_ 2 1 -647005.
##  3 mod_ 1 2 -647005.
##  4 mod_ 2 3 -646986.
##  5 mod_ 1 3 -646971.
##  6 mod_ 1 4 -646967.
##  7 mod_ 2 2 -646900.
##  8 mod_ 3 3 -646885.
##  9 mod_ 3 1 -646859.
## 10 mod_ 1 5 -646859.
## # ... with 15 more rows&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;As we see the simpler model with one lag for each component fit well the residuals
we can check the residuals of this model with box test.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;model_garch &amp;lt;- tseries::garch(model_arima$residuals, order = c(1,1), trace=F)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## Warning in tseries::garch(model_arima$residuals, order = c(1, 1), trace = F):
## singular information&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;Box.test(model_garch$residuals)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## 
##  Box-Pierce test
## 
## data:  model_garch$residuals
## X-squared = 3.1269, df = 1, p-value = 0.07701&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;With significance level of 5% we do not reject the null hypothesis of independence.
As an alternative we can inspect the acf of the residuals.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;acf(model_garch$residuals[-1])&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://modelingwithr.rbind.io/courses/rnn/2020-05-05-time-series-with-recurrent-neaural-network-rnn-lstm-model_files/figure-html/unnamed-chunk-14-1.svg&#34; width=&#34;576&#34; /&gt;&lt;/p&gt;
&lt;p&gt;The easiest way to get prediction from our model is by making use of the &lt;strong&gt;rugarch&lt;/strong&gt; package. First, we specify the model with the parameters obtained above (the different lags)&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# garch1 &amp;lt;- ugarchspec(mean.model = list(armaOrder = c(0,2), include.mean = FALSE), 
# variance.model = list(garchOrder = c(1,1))) &lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Then we use the function &lt;strong&gt;ugarchfit&lt;/strong&gt; to predict our data_train. However, you might noticed that we supplied only the lags of the AR and MA parts of our ARIMA model (the d value for integration is not available in this function), so we should provide the differenced series of &lt;strong&gt;data_train&lt;/strong&gt; instead of the original series.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;Ddata_train &amp;lt;- diff(data_train)
# garchfit &amp;lt;- ugarchfit(data=Ddata_train, spec = garch1, solver = &amp;quot;gosolnp&amp;quot;,trace=F)
# coef(garchfit)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Our final model will be written as follows.&lt;/p&gt;
&lt;p&gt;&lt;span class=&#34;math display&#34;&gt;\[y_t=e_t-4.296.10^{-2}e_{t-1}+5.687.10^{-3}e_{t-2} \\
e_t\sim N(0,\hat\sigma_t^2) \\
\hat\sigma_t^2=1.950.10^{-7}+2.565.10^{-1}e_{t-1}^2+6.940.10^{-1}\hat\sigma_{t-1}^2\]&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;NOTE&lt;/strong&gt;: when running the above model we get different results due to the internal randomization process, that is why i commented the above code to prevent it to be rerun again when rendering this document.&lt;/p&gt;
&lt;p&gt;Now we use this model for forecasting 100 future values to be compared then with the data_test values.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# fitted &amp;lt;- ugarchforecast(garchfit, n.ahead=100)
#yh_test&amp;lt;-numeric()
#for (i in 2:100){
#  yh_test[1] &amp;lt;- data_train[length(data_train)]+fitted(fitted)[1]
#  yh_test[i] &amp;lt;- yh_test[i-1]+fitted(fitted)[i]
#}
#df_eval &amp;lt;- tibble::tibble(y_test=data_test, yh_test=yh_test)
#df_eval&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Finally we should save the &lt;strong&gt;df_eval&lt;/strong&gt; table with the original and the fitted values of the data_test for further use.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;#write.csv(df_eval, &amp;quot;df_eval.csv&amp;quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div id=&#34;rnn-model&#34; class=&#34;section level1&#34; number=&#34;4&#34;&gt;
&lt;h1&gt;&lt;span class=&#34;header-section-number&#34;&gt;4&lt;/span&gt; RNN model&lt;/h1&gt;
&lt;p&gt;As an alternative to ARIMA prediction method discussed above, the deep learning RNN method can also take into account the memory of the time series. Unlike the classical feedforward networks that process each single input independently, the RNN takes a bunch of inputs that supposed to be in one sequence and process them together as showed in the first plot. In keras this step can be achieved by &lt;strong&gt;layer_simple_rnn&lt;/strong&gt; (Chollet, 2017, p167].
This means we have to decide the length of the sequence, in other words how far back we think that the current value is depending on (the memory of the time series). In our case we think that 7 days values should be satisfactory to predict the current value.&lt;/p&gt;
&lt;div id=&#34;reshape-the-time-series&#34; class=&#34;section level3&#34; number=&#34;4.0.1&#34;&gt;
&lt;h3&gt;&lt;span class=&#34;header-section-number&#34;&gt;4.0.1&lt;/span&gt; Reshape the time series&lt;/h3&gt;
&lt;p&gt;The first thing we do is organizing the data in such way that the model knows what part is considered as sequences to be processed by the rnn layer, and what part is the target variable. To do so we reorganize the time series into a matrix where each row is a single input , and the columns contain the lagged values (of the target variable) up to 7 and the target variable in the last column. Consequently, The total number of rows will be the &lt;strong&gt;length(data)-maxlen-1&lt;/strong&gt;, where maxlen refers to the length of each sequences (constant) which here is equal to 7.&lt;/p&gt;
&lt;p&gt;Let’s first create an empty matrix&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;maxlen &amp;lt;- 7
exch_matrix&amp;lt;- matrix(0, nrow = length(data_train)-maxlen-1, ncol = maxlen+1) &lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Now let’s move our time series to this matrix and display some rows to be sure that the output is as expected to be.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;for(i in 1:(length(data_train)-maxlen-1)){
  exch_matrix[i,] &amp;lt;- data_train[i:(i+maxlen)]
}
head(exch_matrix)  &lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;##        [,1]   [,2]   [,3]   [,4]   [,5]   [,6]   [,7]   [,8]
## [1,] 1.1930 1.1941 1.1933 1.1931 1.1924 1.1926 1.1926 1.1932
## [2,] 1.1941 1.1933 1.1931 1.1924 1.1926 1.1926 1.1932 1.1933
## [3,] 1.1933 1.1931 1.1924 1.1926 1.1926 1.1932 1.1933 1.1932
## [4,] 1.1931 1.1924 1.1926 1.1926 1.1932 1.1933 1.1932 1.1933
## [5,] 1.1924 1.1926 1.1926 1.1932 1.1933 1.1932 1.1933 1.1934
## [6,] 1.1926 1.1926 1.1932 1.1933 1.1932 1.1933 1.1934 1.1940&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Now we separate the inputs from the target.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;x_train &amp;lt;- exch_matrix[, -ncol(exch_matrix)]
y_train &amp;lt;- exch_matrix[, ncol(exch_matrix)]&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The rnn layer in keras expects the inputs to be of the shape (examples, maxlen, number of features), since then we have only one feature (our single time series that is processed sequentially) the shape of the inputs should be c(examples, 7,1). However, the first dimension can be discarded and we can provide only the last ones.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;dim(x_train)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## [1] 62388     7&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;As we see this shape does not include the number of features, so we can correct it as follows.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;x_train &amp;lt;- array_reshape(x_train, dim = c((length(data_train)-maxlen-1), maxlen, 1))
dim(x_train)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## [1] 62388     7     1&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div id=&#34;model-architecture&#34; class=&#34;section level2&#34; number=&#34;4.1&#34;&gt;
&lt;h2&gt;&lt;span class=&#34;header-section-number&#34;&gt;4.1&lt;/span&gt; Model architecture&lt;/h2&gt;
&lt;p&gt;When it comes to deep learning models, there is a large space for hyperparameters to be defined and the results are heavily depending on these hyperparameters, such as the optimal number of layers, the optimal number of nodes in each layer, the suitable activation function, the suitable loss function, the best optimizer, the best regularization techniques, the best random initialization , …etc. Unfortunately, we do not have yet an exact rule to decide about these hyperparameters, and they depend on the problem under study, the data at hand, and the experience of the modeler. In our case, for instance, our data is very simple, and, actually does not require complex architecture, we will thus use only one hidden rnn layer with 10 nodes, the loss function will be the mean square error &lt;strong&gt;mse&lt;/strong&gt; , the optimizer will be &lt;strong&gt;adam&lt;/strong&gt;, and the metric will be the mean absolute error &lt;strong&gt;mae&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Note&lt;/strong&gt; : with large and complex time series it might be needed to stack many rnn layers.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;model &amp;lt;- keras_model_sequential()
model %&amp;gt;% 
  layer_dense(input_shape = dim(x_train)[-1], units=maxlen) %&amp;gt;% 
  layer_simple_rnn(units=10) %&amp;gt;% 
  layer_dense(units = 1)
summary(model)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## Model: &amp;quot;sequential&amp;quot;
## ________________________________________________________________________________
## Layer (type)                        Output Shape                    Param #     
## ================================================================================
## dense (Dense)                       (None, 7, 7)                    14          
## ________________________________________________________________________________
## simple_rnn (SimpleRNN)              (None, 10)                      180         
## ________________________________________________________________________________
## dense_1 (Dense)                     (None, 1)                       11          
## ================================================================================
## Total params: 205
## Trainable params: 205
## Non-trainable params: 0
## ________________________________________________________________________________&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div id=&#34;model-training&#34; class=&#34;section level2&#34; number=&#34;4.2&#34;&gt;
&lt;h2&gt;&lt;span class=&#34;header-section-number&#34;&gt;4.2&lt;/span&gt; Model training&lt;/h2&gt;
&lt;p&gt;Now let’s compile and run the model with 5 epochs, batch_size of 32 instances at a time to update the weights, and to keep track of the model performance we held out 10% of the training data as validation set.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;model %&amp;gt;% compile(
  loss = &amp;quot;mse&amp;quot;,
  optimizer= &amp;quot;adam&amp;quot;,
  metric = &amp;quot;mae&amp;quot; 
)&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;#history &amp;lt;- model %&amp;gt;% 
#  fit(x_train, y_train, epochs = 5, batch_size = 32, validation_split=0.1)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;since each time we rerun the model we will get different results, so we should save the model (or only the model weights) and reload it again, doing so when rendering the document we will not be surprised by other outputs.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;#save_model_hdf5(model, &amp;quot;rnn_model.h5&amp;quot;)
rnn_model &amp;lt;- load_model_hdf5(&amp;quot;rnn_model.h5&amp;quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div id=&#34;prediction&#34; class=&#34;section level2&#34; number=&#34;4.3&#34;&gt;
&lt;h2&gt;&lt;span class=&#34;header-section-number&#34;&gt;4.3&lt;/span&gt; Prediction&lt;/h2&gt;
&lt;p&gt;In order to get the prediction of the last 100 data point, we will predict the entire data then we compute the &lt;strong&gt;rmse&lt;/strong&gt; for the last 100 predictions.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;maxlen &amp;lt;- 7
exch_matrix2&amp;lt;- matrix(0, nrow = length(data)-maxlen-1, ncol = maxlen+1) 

for(i in 1:(length(data)-maxlen-1)){
  exch_matrix2[i,] &amp;lt;- data[i:(i+maxlen)]
}

x_train2 &amp;lt;- exch_matrix2[, -ncol(exch_matrix2)]
y_train2 &amp;lt;- exch_matrix2[, ncol(exch_matrix2)]

x_train2 &amp;lt;- array_reshape(x_train2, dim = c((length(data)-maxlen-1), maxlen, 1))&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;pred &amp;lt;- rnn_model %&amp;gt;% predict(x_train2)
df_eval_rnn &amp;lt;- tibble::tibble(y_rnn=y_train2[(length(y_train2)-99):length(y_train2)],
                          yhat_rnn=as.vector(pred)[(length(y_train2)-99):length(y_train2)])&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div id=&#34;results-comparison&#34; class=&#34;section level1&#34; number=&#34;5&#34;&gt;
&lt;h1&gt;&lt;span class=&#34;header-section-number&#34;&gt;5&lt;/span&gt; results comparison&lt;/h1&gt;
&lt;p&gt;we can now compare the prediction of the last 100 data points using this model with the predicted values for the same data points using the ARIMA model. We first load the above data predicted with ARIMA model and join every thing in one data frame, then we use two metrics to compare, &lt;strong&gt;rmse&lt;/strong&gt;, &lt;strong&gt;mae&lt;/strong&gt; which are easily available in &lt;strong&gt;ModelMetrics&lt;/strong&gt; package.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Note&lt;/strong&gt;: You might want to ask why we only use 100 data points for predictions where usually, in machine learning, we use a large number sometimes 20% of the entire data. The answer is because of the nature of the ARIMA models which are a short term prediction models, especially with financial data that are characterized by the high and instable volatility (that is why we use garch model above).&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;df_eval &amp;lt;- read.csv(&amp;quot;df_eval.csv&amp;quot;)
rmse &amp;lt;- c(rmse(df_eval$y_test, df_eval$yh_test), 
          rmse(df_eval_rnn$y_rnn, df_eval_rnn$yhat_rnn) )
mae &amp;lt;- c(mae(df_eval$y_test, df_eval$yh_test), 
          mae(df_eval_rnn$y_rnn, df_eval_rnn$yhat_rnn) )
df &amp;lt;- tibble::tibble(model=c(&amp;quot;ARIMA&amp;quot;, &amp;quot;RNN&amp;quot;), rmse, mae)
df&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## # A tibble: 2 x 3
##   model    rmse     mae
##   &amp;lt;chr&amp;gt;   &amp;lt;dbl&amp;gt;   &amp;lt;dbl&amp;gt;
## 1 ARIMA 0.00563 0.00388
## 2 RNN   0.00442 0.00401&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;As we see, The two models are closer to each other. However, if we use the &lt;strong&gt;rmse&lt;/strong&gt;, which is the popular metrics used with continuous variables the rnn model is better, but with &lt;strong&gt;mae&lt;/strong&gt; they are approximately the same.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;conclusion&#34; class=&#34;section level1&#34; number=&#34;6&#34;&gt;
&lt;h1&gt;&lt;span class=&#34;header-section-number&#34;&gt;6&lt;/span&gt; Conclusion&lt;/h1&gt;
&lt;p&gt;Even though this data is very simple and does not need an RNN model, and it can be predicted with the classical ARIMA models, but it is used here for pedagogic purposes to well understand how the RNN works, and how the data should be processed to be ingested by &lt;strong&gt;keras&lt;/strong&gt;. However, rnn model suffers from a major problem when running a large sequence known as &lt;strong&gt;Vanishing gradient&lt;/strong&gt; and &lt;strong&gt;exploding gradient&lt;/strong&gt;. In other words, with the former, when using the chain rule to compute the gradients, if the derivatives have small values then multiplying a large number of small values (as the length of the sequence) yields very tiny values that cause the network to be slowly trainable or even untrainable. The opposite is true when we face the latter problem, in this case we will get very large values and the network never converges.&lt;br /&gt;
Soon I will post an article with multivariate time series by implementing Long Short term memory &lt;strong&gt;LSTM&lt;/strong&gt; model that is supposed to overcome the above problems that faces simple rnn model .&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;further-reading&#34; class=&#34;section level1&#34; number=&#34;7&#34;&gt;
&lt;h1&gt;&lt;span class=&#34;header-section-number&#34;&gt;7&lt;/span&gt; Further reading&lt;/h1&gt;
&lt;ul&gt;
&lt;li&gt;Froncois Chollet, Deep learning with R, Meap edition, 2017, p167&lt;/li&gt;
&lt;li&gt;Ian Godfollow et al, Deep Learning, &lt;a href=&#34;http://www.deeplearningbook.org/&#34; class=&#34;uri&#34;&gt;http://www.deeplearningbook.org/&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
&lt;div id=&#34;session-info&#34; class=&#34;section level1&#34; number=&#34;8&#34;&gt;
&lt;h1&gt;&lt;span class=&#34;header-section-number&#34;&gt;8&lt;/span&gt; Session info&lt;/h1&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;sessionInfo()&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## R version 4.0.1 (2020-06-06)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 19041)
## 
## Matrix products: default
## 
## locale:
## [1] LC_COLLATE=English_United States.1252 
## [2] LC_CTYPE=English_United States.1252   
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C                          
## [5] LC_TIME=English_United States.1252    
## 
## attached base packages:
## [1] parallel  stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
## [1] keras_2.3.0.0        ModelMetrics_1.2.2.2 rugarch_1.4-4       
## [4] forecast_8.13        aTSA_3.1.2           tseries_0.10-47     
## [7] timeSeries_3062.100  timeDate_3043.102   
## 
## loaded via a namespace (and not attached):
##  [1] jsonlite_1.7.1              assertthat_0.2.1           
##  [3] TTR_0.24.2                  tiff_0.1-5                 
##  [5] yaml_2.2.1                  GeneralizedHyperbolic_0.8-4
##  [7] numDeriv_2016.8-1.1         pillar_1.4.6               
##  [9] lattice_0.20-41             reticulate_1.16            
## [11] glue_1.4.2                  quadprog_1.5-8             
## [13] DistributionUtils_0.6-0     digest_0.6.25              
## [15] colorspace_1.4-1            htmltools_0.5.0            
## [17] Matrix_1.2-18               pkgconfig_2.0.3            
## [19] bookdown_0.20               purrr_0.3.4                
## [21] fftwtools_0.9-9             mvtnorm_1.1-1              
## [23] scales_1.1.1                whisker_0.4                
## [25] jpeg_0.1-8.1                tibble_3.0.3               
## [27] farver_2.0.3                EBImage_4.30.0             
## [29] generics_0.0.2              ggplot2_3.3.2              
## [31] ellipsis_0.3.1              urca_1.3-0                 
## [33] nnet_7.3-14                 BiocGenerics_0.34.0        
## [35] cli_2.0.2                   quantmod_0.4.17            
## [37] magrittr_1.5                crayon_1.3.4               
## [39] mclust_5.4.6                evaluate_0.14              
## [41] ks_1.11.7                   fansi_0.4.1                
## [43] nlme_3.1-149                MASS_7.3-53                
## [45] xts_0.12.1                  truncnorm_1.0-8            
## [47] blogdown_0.20               tools_4.0.1                
## [49] data.table_1.13.0           lifecycle_0.2.0            
## [51] stringr_1.4.0               munsell_0.5.0              
## [53] locfit_1.5-9.4              compiler_4.0.1             
## [55] SkewHyperbolic_0.4-0        rlang_0.4.7                
## [57] grid_4.0.1                  RCurl_1.98-1.2             
## [59] nloptr_1.2.2.2              rappdirs_0.3.1             
## [61] htmlwidgets_1.5.2           Rsolnp_1.16                
## [63] labeling_0.3                base64enc_0.1-3            
## [65] spd_2.0-1                   bitops_1.0-6               
## [67] rmarkdown_2.4               gtable_0.3.0               
## [69] fracdiff_1.5-1              abind_1.4-5                
## [71] curl_4.3                    R6_2.4.1                   
## [73] tfruns_1.4                  zoo_1.8-8                  
## [75] tensorflow_2.2.0            knitr_1.30                 
## [77] dplyr_1.0.2                 utf8_1.1.4                 
## [79] zeallot_0.1.0               KernSmooth_2.23-17         
## [81] stringi_1.5.3               Rcpp_1.0.5                 
## [83] vctrs_0.3.4                 png_0.1-7                  
## [85] tidyselect_1.1.0            xfun_0.18                  
## [87] lmtest_0.9-38&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
</description>
    </item>
    
    <item>
      <title>Local Snsitivity Hashing Model</title>
      <link>https://modelingwithr.rbind.io/sparklyr/lsh_spark/local-snsitivity-hashing-model/</link>
      <pubDate>Tue, 28 Apr 2020 00:00:00 +0000</pubDate>
      <guid>https://modelingwithr.rbind.io/sparklyr/lsh_spark/local-snsitivity-hashing-model/</guid>
      <description>
&lt;script src=&#34;https://modelingwithr.rbind.io/rmarkdown-libs/header-attrs/header-attrs.js&#34;&gt;&lt;/script&gt;

&lt;div id=&#34;TOC&#34;&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&#34;#introduction&#34;&gt;&lt;span class=&#34;toc-section-number&#34;&gt;1&lt;/span&gt; Introduction&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#data-preparation&#34;&gt;&lt;span class=&#34;toc-section-number&#34;&gt;2&lt;/span&gt; Data Preparation&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#prediction&#34;&gt;&lt;span class=&#34;toc-section-number&#34;&gt;3&lt;/span&gt; Prediction&lt;/a&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&#34;#similarity-based-on-distance&#34;&gt;&lt;span class=&#34;toc-section-number&#34;&gt;3.1&lt;/span&gt; Similarity based on distance&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#similarity-based-on-the-number-of-nearest-neighbours&#34;&gt;&lt;span class=&#34;toc-section-number&#34;&gt;3.2&lt;/span&gt; Similarity based on the number of nearest neighbours&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#conclusion&#34;&gt;&lt;span class=&#34;toc-section-number&#34;&gt;4&lt;/span&gt; Conclusion&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#further-reading&#34;&gt;&lt;span class=&#34;toc-section-number&#34;&gt;5&lt;/span&gt; Further reading&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#session-information&#34;&gt;&lt;span class=&#34;toc-section-number&#34;&gt;6&lt;/span&gt; Session information&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;

&lt;style type=&#34;text/css&#34;&gt;
strong {
  color: Navy;
}

h1,h2, h3, h4 {
  font-size:28px;
  color:DarkBlue;
}
&lt;/style&gt;
&lt;div id=&#34;introduction&#34; class=&#34;section level1&#34; number=&#34;1&#34;&gt;
&lt;h1&gt;&lt;span class=&#34;header-section-number&#34;&gt;1&lt;/span&gt; Introduction&lt;/h1&gt;
&lt;p&gt;This model is an approximate version of knn model which is difficult to be implemented with large data set. In contrast to knn model that looks for the exact number of nearest neighbours, this model looks for neighbours with high probabilities. Spark provides two methods to find out the approximate neighbours that depend on the data type at hand, &lt;strong&gt;Bucketed random projection&lt;/strong&gt; and &lt;strong&gt;Minhash for jaccard distance&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;The first method projects the data in lower dimension hash in which similar hashes indicate that the associated points (or observations) are close to each other. The mathematical basis of this technique is the following formula.&lt;/p&gt;
&lt;p&gt;&lt;span class=&#34;math display&#34;&gt;\[h^{x,b}(\vec\upsilon)=\lfloor \frac{\vec\upsilon.\vec x}{w}\rfloor\]&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;Where &lt;span class=&#34;math inline&#34;&gt;\(h\)&lt;/span&gt; is the hashing function, &lt;span class=&#34;math inline&#34;&gt;\(\vec\upsilon\)&lt;/span&gt; is the feature vector, &lt;span class=&#34;math inline&#34;&gt;\(x\)&lt;/span&gt; is standard normal vector that has the same length, and &lt;span class=&#34;math inline&#34;&gt;\(w\)&lt;/span&gt; is the bin width of the hashing bins, and the symbol &lt;span class=&#34;math inline&#34;&gt;\(\lfloor \rfloor\)&lt;/span&gt; to coerce the result to be integer value. The idea is simple, we take the dot product of each feature vector with noisy vector, then the resulted projections (which are random) will be grouped into buckets, these buckets are supposed to include similar points. This process can be repeated many times with different noisy vector at each time to fine the similarity. For more detail about this technique click &lt;a href=&#34;https://spark.apache.org/docs/latest/ml-features.html#locality-sensitive-hashing&#34;&gt;here&lt;/a&gt;&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;data-preparation&#34; class=&#34;section level1&#34; number=&#34;2&#34;&gt;
&lt;h1&gt;&lt;span class=&#34;header-section-number&#34;&gt;2&lt;/span&gt; Data Preparation&lt;/h1&gt;
&lt;p&gt;For those who do not know much about sparklyr check my article &lt;a href=&#34;https://modelingwithr.rbind.io/sparklyr/sparklyr/&#34;&gt;introduction to sparklyr&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;First let’s call sparklyr and tidyverse packages, then we set the connection to spark and call the titanic data.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;library(sparklyr, warn.conflicts = FALSE)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;Warning: package &amp;#39;sparklyr&amp;#39; was built under R version 4.0.2&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;library(tidyverse, warn.conflicts = FALSE)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;Warning: package &amp;#39;ggplot2&amp;#39; was built under R version 4.0.2&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;Warning: package &amp;#39;tibble&amp;#39; was built under R version 4.0.2&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;Warning: package &amp;#39;tidyr&amp;#39; was built under R version 4.0.2&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;Warning: package &amp;#39;dplyr&amp;#39; was built under R version 4.0.2&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;sc &amp;lt;- spark_connect(master = &amp;quot;local&amp;quot;)
mydata &amp;lt;- spark_read_csv(sc, &amp;quot;titanic&amp;quot;, path = &amp;quot;C://Users/dell/Documents/new-blog/content/post/train.csv&amp;quot;)
glimpse(mydata)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;Rows: ??
Columns: 12
Database: spark_connection
$ PassengerId &amp;lt;int&amp;gt; 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, ...
$ Survived    &amp;lt;int&amp;gt; 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0...
$ Pclass      &amp;lt;int&amp;gt; 3, 1, 3, 1, 3, 3, 1, 3, 3, 2, 3, 1, 3, 3, 3, 2, 3, 2, 3...
$ Name        &amp;lt;chr&amp;gt; &amp;quot;Braund, Mr. Owen Harris&amp;quot;, &amp;quot;Cumings, Mrs. John Bradley ...
$ Sex         &amp;lt;chr&amp;gt; &amp;quot;male&amp;quot;, &amp;quot;female&amp;quot;, &amp;quot;female&amp;quot;, &amp;quot;female&amp;quot;, &amp;quot;male&amp;quot;, &amp;quot;male&amp;quot;, &amp;quot;...
$ Age         &amp;lt;dbl&amp;gt; 22, 38, 26, 35, 35, NaN, 54, 2, 27, 14, 4, 58, 20, 39, ...
$ SibSp       &amp;lt;int&amp;gt; 1, 1, 0, 1, 0, 0, 0, 3, 0, 1, 1, 0, 0, 1, 0, 0, 4, 0, 1...
$ Parch       &amp;lt;int&amp;gt; 0, 0, 0, 0, 0, 0, 0, 1, 2, 0, 1, 0, 0, 5, 0, 0, 1, 0, 0...
$ Ticket      &amp;lt;chr&amp;gt; &amp;quot;A/5 21171&amp;quot;, &amp;quot;PC 17599&amp;quot;, &amp;quot;STON/O2. 3101282&amp;quot;, &amp;quot;113803&amp;quot;, ...
$ Fare        &amp;lt;dbl&amp;gt; 7.2500, 71.2833, 7.9250, 53.1000, 8.0500, 8.4583, 51.86...
$ Cabin       &amp;lt;chr&amp;gt; NA, &amp;quot;C85&amp;quot;, NA, &amp;quot;C123&amp;quot;, NA, NA, &amp;quot;E46&amp;quot;, NA, NA, NA, &amp;quot;G6&amp;quot;,...
$ Embarked    &amp;lt;chr&amp;gt; &amp;quot;S&amp;quot;, &amp;quot;C&amp;quot;, &amp;quot;S&amp;quot;, &amp;quot;S&amp;quot;, &amp;quot;S&amp;quot;, &amp;quot;Q&amp;quot;, &amp;quot;S&amp;quot;, &amp;quot;S&amp;quot;, &amp;quot;S&amp;quot;, &amp;quot;C&amp;quot;, &amp;quot;S&amp;quot;, ...&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;If wou notice this data is not large, but we intentially choose this data due to its familiarity and simplicity which make understanding the implementation of this model super easy. In other words, when we want to implement this model with very large data sets we repeat the same general basic steps.&lt;/p&gt;
&lt;p&gt;Then we remove some varaibles that we think they are not much relevant for out puptose except for the &lt;strong&gt;PassengerId&lt;/strong&gt; variable because we need it later (but we give it a shorter name).&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;newdata &amp;lt;- mydata %&amp;gt;% select(c(1, 2, 3, 5, 6, 7, 8, 10, 12)) %&amp;gt;% rename(id = PassengerId) %&amp;gt;% 
    glimpse()&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;Rows: ??
Columns: 9
Database: spark_connection
$ id       &amp;lt;int&amp;gt; 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17,...
$ Survived &amp;lt;int&amp;gt; 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1...
$ Pclass   &amp;lt;int&amp;gt; 3, 1, 3, 1, 3, 3, 1, 3, 3, 2, 3, 1, 3, 3, 3, 2, 3, 2, 3, 3...
$ Sex      &amp;lt;chr&amp;gt; &amp;quot;male&amp;quot;, &amp;quot;female&amp;quot;, &amp;quot;female&amp;quot;, &amp;quot;female&amp;quot;, &amp;quot;male&amp;quot;, &amp;quot;male&amp;quot;, &amp;quot;mal...
$ Age      &amp;lt;dbl&amp;gt; 22, 38, 26, 35, 35, NaN, 54, 2, 27, 14, 4, 58, 20, 39, 14,...
$ SibSp    &amp;lt;int&amp;gt; 1, 1, 0, 1, 0, 0, 0, 3, 0, 1, 1, 0, 0, 1, 0, 0, 4, 0, 1, 0...
$ Parch    &amp;lt;int&amp;gt; 0, 0, 0, 0, 0, 0, 0, 1, 2, 0, 1, 0, 0, 5, 0, 0, 1, 0, 0, 0...
$ Fare     &amp;lt;dbl&amp;gt; 7.2500, 71.2833, 7.9250, 53.1000, 8.0500, 8.4583, 51.8625,...
$ Embarked &amp;lt;chr&amp;gt; &amp;quot;S&amp;quot;, &amp;quot;C&amp;quot;, &amp;quot;S&amp;quot;, &amp;quot;S&amp;quot;, &amp;quot;S&amp;quot;, &amp;quot;Q&amp;quot;, &amp;quot;S&amp;quot;, &amp;quot;S&amp;quot;, &amp;quot;S&amp;quot;, &amp;quot;C&amp;quot;, &amp;quot;S&amp;quot;, &amp;quot;S&amp;quot;...&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;May be the first thing we do in explaratory analysis is to check the missing values.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;newdata %&amp;gt;% mutate_all(is.na) %&amp;gt;% mutate_all(as.numeric) %&amp;gt;% summarise_all(sum)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;# Source: spark&amp;lt;?&amp;gt; [?? x 9]
     id Survived Pclass   Sex   Age SibSp Parch  Fare Embarked
  &amp;lt;dbl&amp;gt;    &amp;lt;dbl&amp;gt;  &amp;lt;dbl&amp;gt; &amp;lt;dbl&amp;gt; &amp;lt;dbl&amp;gt; &amp;lt;dbl&amp;gt; &amp;lt;dbl&amp;gt; &amp;lt;dbl&amp;gt;    &amp;lt;dbl&amp;gt;
1     0        0      0     0   177     0     0     0        2&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Since we have a large number of missing values it would be better to imput thes values rather than removing them. For the numeric variable &lt;strong&gt;Age&lt;/strong&gt; we replace them by the median using the sparklyr function &lt;strong&gt;ft_imputer&lt;/strong&gt;, and for categorical variable &lt;strong&gt;Embarked&lt;/strong&gt; we use the most frequantly level which is here &lt;strong&gt;S&lt;/strong&gt; port. But before this we should split the data first into training and testing sets to make sure that the testing set is completely isolated from the training set, then we impute each separately.&lt;/p&gt;
&lt;p&gt;Since the data are a little bit imbalanced we randomly split the data separately with respect to the target variable &lt;strong&gt;Survived&lt;/strong&gt; in order to preserve the same proportions of the Survived varaible levels as the original data, then we rebind the corresponding sets again.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;data_surv &amp;lt;- newdata %&amp;gt;% filter(Survived == 1)
data_not &amp;lt;- newdata %&amp;gt;% filter(Survived == 0)
partition_surv &amp;lt;- data_surv %&amp;gt;% sdf_random_split(training = 0.8, test = 0.2, seed = 123)
partition_not &amp;lt;- data_not %&amp;gt;% sdf_random_split(training = 0.8, test = 0.2, seed = 123)
train &amp;lt;- sdf_bind_rows(partition_surv$training, partition_not$training) %&amp;gt;% ft_imputer(input_cols = &amp;quot;Age&amp;quot;, 
    output_cols = &amp;quot;Age&amp;quot;, strategy = &amp;quot;median&amp;quot;) %&amp;gt;% na.replace(Embarked = &amp;quot;S&amp;quot;) %&amp;gt;% 
    compute(&amp;quot;train&amp;quot;)
test &amp;lt;- sdf_bind_rows(partition_surv$test, partition_not$test) %&amp;gt;% ft_imputer(input_cols = &amp;quot;Age&amp;quot;, 
    output_cols = &amp;quot;Age&amp;quot;, strategy = &amp;quot;median&amp;quot;) %&amp;gt;% na.replace(Embarked = &amp;quot;S&amp;quot;) %&amp;gt;% 
    compute(&amp;quot;test&amp;quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Not that we use &lt;strong&gt;compute&lt;/strong&gt; function to cache the output into spark memory.&lt;/p&gt;
&lt;p&gt;Before fitting any model the data must be processed in a way that can be consumed by the model. For our model, such as the most machine learning models, requires numeric features, we convert thus categorical variables to integers using the function &lt;strong&gt;ft_string_indexer&lt;/strong&gt;, after that we convert them to dumy variables using the function &lt;strong&gt;ft_one hot_encoder_estimator&lt;/strong&gt;, because the last function expects the inputs to be numeric.&lt;/p&gt;
&lt;p&gt;For models build in sparklyr, the input variables should be stacked into one column vector on each other, this can be easily done by using the function &lt;strong&gt;ft_vector_assembler&lt;/strong&gt;. However, this step does not prevent us to apply some other transformation even the features are in one column. For instance, to run efficiently our model we can transform the variables to be of the same scale, to do so we can either use standardization (sa we do here) or normalization method.&lt;/p&gt;
&lt;p&gt;It is a good practice to save this preocessed set into the spark memory under an object name using the function &lt;strong&gt;compute&lt;/strong&gt;&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;trained &amp;lt;- train %&amp;gt;% ft_string_indexer(input_col = &amp;quot;Sex&amp;quot;, output_col = &amp;quot;Sex_indexed&amp;quot;) %&amp;gt;% 
    ft_string_indexer(input_col = &amp;quot;Embarked&amp;quot;, output_col = &amp;quot;Embarked_indexed&amp;quot;) %&amp;gt;% 
    ft_one_hot_encoder_estimator(input_cols = c(&amp;quot;Pclass&amp;quot;, &amp;quot;Sex_indexed&amp;quot;, &amp;quot;Embarked_indexed&amp;quot;), 
        output_cols = c(&amp;quot;Pc_encod&amp;quot;, &amp;quot;Sex_encod&amp;quot;, &amp;quot;Emb_encod&amp;quot;)) %&amp;gt;% ft_vector_assembler(input_cols = c(&amp;quot;Pc_encod&amp;quot;, 
    &amp;quot;Sex_encod&amp;quot;, &amp;quot;Age&amp;quot;, &amp;quot;SibSp&amp;quot;, &amp;quot;Parch&amp;quot;, &amp;quot;Fare&amp;quot;, &amp;quot;Emb_encod&amp;quot;), output_col = &amp;quot;features&amp;quot;) %&amp;gt;% 
    ft_standard_scaler(input_col = &amp;quot;features&amp;quot;, output_col = &amp;quot;scaled&amp;quot;, with_mean = TRUE) %&amp;gt;% 
    select(id, Survived, scaled) %&amp;gt;% compute(&amp;quot;trained&amp;quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The same transformations above will be applied to the testing set &lt;strong&gt;test&lt;/strong&gt; as follows.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;tested &amp;lt;- test %&amp;gt;% ft_string_indexer(input_col = &amp;quot;Sex&amp;quot;, output_col = &amp;quot;Sex_indexed&amp;quot;) %&amp;gt;% 
    ft_string_indexer(input_col = &amp;quot;Embarked&amp;quot;, output_col = &amp;quot;Embarked_indexed&amp;quot;) %&amp;gt;% 
    ft_one_hot_encoder_estimator(input_cols = c(&amp;quot;Pclass&amp;quot;, &amp;quot;Sex_indexed&amp;quot;, &amp;quot;Embarked_indexed&amp;quot;), 
        output_cols = c(&amp;quot;Pc_encod&amp;quot;, &amp;quot;Sex_encod&amp;quot;, &amp;quot;Emb_encod&amp;quot;)) %&amp;gt;% ft_vector_assembler(input_cols = c(&amp;quot;Pc_encod&amp;quot;, 
    &amp;quot;Sex_encod&amp;quot;, &amp;quot;Age&amp;quot;, &amp;quot;SibSp&amp;quot;, &amp;quot;Parch&amp;quot;, &amp;quot;Fare&amp;quot;, &amp;quot;Emb_encod&amp;quot;), output_col = &amp;quot;features&amp;quot;) %&amp;gt;% 
    ft_standard_scaler(input_col = &amp;quot;features&amp;quot;, output_col = &amp;quot;scaled&amp;quot;, with_mean = TRUE) %&amp;gt;% 
    select(id, Survived, scaled) %&amp;gt;% compute(&amp;quot;tested&amp;quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Now we are ready to project the data on the lower dimension hash using the function &lt;strong&gt;ft_bucketed_random_projection_lsh&lt;/strong&gt; with buckets of length 3 and 5 hash tables.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;lsh_vector &amp;lt;- ft_bucketed_random_projection_lsh(sc, input_col = &amp;quot;scaled&amp;quot;, output_col = &amp;quot;hash&amp;quot;, 
    bucket_length = 3, num_hash_tables = 5, seed = 444)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;To fit this model we feed the function &lt;strong&gt;ml_fit&lt;/strong&gt; by the training data &lt;strong&gt;trained&lt;/strong&gt;.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;model_lsh &amp;lt;- ml_fit(lsh_vector, trained)&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div id=&#34;prediction&#34; class=&#34;section level1&#34; number=&#34;3&#34;&gt;
&lt;h1&gt;&lt;span class=&#34;header-section-number&#34;&gt;3&lt;/span&gt; Prediction&lt;/h1&gt;
&lt;p&gt;At the prediction stage this model of classification gives us two alternatives for how we define the nearest neighbours:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;p&gt;define a threshold value from which we decide if two observations are considered as nearest neighbours or not, small value leads to take small number of neighbours. in sparklyr we can achive that using the function &lt;strong&gt;ml_approx_similarity_join&lt;/strong&gt; and we specify the the threshold value for the minimum distance. the distance used by this function is the classical euclidien distance.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;prespecify the number of the nearest neighbours regardeless of the distance between observations. This second alternative can be achieved using &lt;strong&gt;ml_approx_nearest_neighbors&lt;/strong&gt;.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Each of which has its advantages and drawbacks depending on the problem at hand. for instance in medecine if you are more interested to check the similarities among patients at some level then the first option would be your choice but you may not be able to predict new cases that are not similar to any of the training cases constrained by this threshold value. In contrast, if your goal is to predict all your new cases then you would opt for the second option, but with the cost of including neighbours that are far a way constrained by the fixed number of neighbours.
To better understand what hppens with each option, let’s use the following data.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;suppressPackageStartupMessages(library(plotrix))
X &amp;lt;- c(55, 31, 35, 34, 15, 28, 8, 38, 35, 19, 27, 40, 39, 19, 66, 28, 42, 21, 18, 
    14, 40, 27, 3, 19, 21, 32, 13, 18, 7, 21, 49)
Y &amp;lt;- c(16, 18, 26, 13, 8.0292, 35.5, 21.075, 31.3875, 7.225, 263, 7.8958, 27.7208, 
    146.5208, 7.75, 10.5, 82.1708, 52, 8.05, 18, 11.2417, 9.475, 21, 41.5792, 7.8792, 
    8.05, 15.5, 7.75, 17.8, 39.6875, 7.8, 76.7292)
Z &amp;lt;- factor(c(1, 0, 0, 1, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 
    1, 0, 0, 1, 0, 0, 0, 1))
plot(X, Y, col = Z, ylim = c(0, 55), pch = 16)
points(x = 32, y = 20, col = &amp;quot;blue&amp;quot;, pch = 8)
draw.circle(x = 32, y = 20, nv = 1000, radius = 6, lty = 2)
points(x = 55, y = 42, col = &amp;quot;green&amp;quot;, pch = 8)
draw.circle(x = 55, y = 42, nv = 1000, radius = 6, lty = 2)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://modelingwithr.rbind.io/sparklyr/lsh_spark/2020-04-28-local-snsitivity-hashing-model_files/figure-html/unnamed-chunk-10-1.svg&#34; width=&#34;576&#34; /&gt;&lt;/p&gt;
&lt;p&gt;Using the fake data above to illustrate the difference between the two methods.
Setting the threshold at 6 for the first option, we see the blue dot has 5 neighbours and this dot would be predicted as black using the majority vote. However, with this threshold the green dot does not have any neighbour around and hence it will be left without prediction.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;plot(X, Y, pch = 16, col = Z, ylim = c(0, 55))
points(x = 32, y = 20, col = &amp;quot;blue&amp;quot;, pch = 8)
points(x = 55, y = 42, col = &amp;quot;green&amp;quot;, pch = 8)
draw.circle(x = 55, y = 42, nv = 1000, radius = 21.8, lty = 2)
draw.circle(x = 32, y = 20, nv = 1000, radius = 6, lty = 2)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://modelingwithr.rbind.io/sparklyr/lsh_spark/2020-04-28-local-snsitivity-hashing-model_files/figure-html/unnamed-chunk-11-1.svg&#34; width=&#34;576&#34; /&gt;&lt;/p&gt;
&lt;p&gt;In contrast to the above plot, using the second option, the green dot can be predicted as black since it has 5 neighbours in whcih 3 are block, but this prediction casts doubt about its quality since all the neighbours are far a way from the dot of interest, and this is the major drawback of this method.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Note&lt;/strong&gt;: In fact we can overcome the drawbacks of each method by tuning the hyperparameters. To get predictions of all the new cases we can increase the distance threshold using the first method so that all the cases will be predicted (but we may lose accuracy if we have any single outlier). And we can reduce the number for the nearest neighbours number to get meaningful similarities (but we may lose accuracy with dots spread out from each other).&lt;/p&gt;
&lt;div id=&#34;similarity-based-on-distance&#34; class=&#34;section level2&#34; number=&#34;3.1&#34;&gt;
&lt;h2&gt;&lt;span class=&#34;header-section-number&#34;&gt;3.1&lt;/span&gt; Similarity based on distance&lt;/h2&gt;
&lt;p&gt;To show the neighbours of each point we use the function
&lt;strong&gt;ml_approx_similarity_join&lt;/strong&gt; provided that the data has an &lt;strong&gt;id&lt;/strong&gt; column, this is thus the reason why we have created this id before.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;approx_join &amp;lt;- ml_approx_similarity_join(model_lsh, trained, trained, threshold = 1, 
    dist_col = &amp;quot;dist&amp;quot;)
approx_join&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;# Source: spark&amp;lt;?&amp;gt; [?? x 3]
    id_a  id_b      dist
   &amp;lt;int&amp;gt; &amp;lt;int&amp;gt;     &amp;lt;dbl&amp;gt;
 1     2   376 0.813    
 2    11    11 0        
 3    16   773 0.189    
 4    16   707 0.787    
 5    20   368 0.0000809
 6    23   290 0.550    
 7    23   157 0.0787   
 8    24   873 0.707    
 9    24   448 0.502    
10    24    84 0.224    
# ... with more rows&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This function joined the data &lt;strong&gt;trained&lt;/strong&gt; with itself to get the similar observations. The threshold determines the value from which we consider two observations as similar. In othe words, cases that has dist value less than 1 will be similar. let’s for instance pick up some similar observations and check out how they are similar.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;train %&amp;gt;% filter(id %in% c(29, 654, 275, 199, 45))&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;# Source: spark&amp;lt;?&amp;gt; [?? x 9]
     id Survived Pclass Sex      Age SibSp Parch  Fare Embarked
  &amp;lt;int&amp;gt;    &amp;lt;int&amp;gt;  &amp;lt;int&amp;gt; &amp;lt;chr&amp;gt;  &amp;lt;dbl&amp;gt; &amp;lt;int&amp;gt; &amp;lt;int&amp;gt; &amp;lt;dbl&amp;gt; &amp;lt;chr&amp;gt;   
1    29        1      3 female    28     0     0  7.88 Q       
2    45        1      3 female    19     0     0  7.88 Q       
3   199        1      3 female    28     0     0  7.75 Q       
4   275        1      3 female    28     0     0  7.75 Q       
5   654        1      3 female    28     0     0  7.83 Q       &lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;As we see all these passengers are all survived females in the same class (third class) without children or parents or siblings embarked from the same port, approximately paied the same ticket price, have the same age (except for 45 with age 19), so they are higly likely to be friends traveling togather.
To predict the test set &lt;strong&gt;tested&lt;/strong&gt; we use the function &lt;strong&gt;ml_predict&lt;/strong&gt;, then we extrat the similarities with the fuction &lt;strong&gt;ml_approx_similarity_join&lt;/strong&gt;.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;hashed &amp;lt;- ml_predict(model_lsh, tested) %&amp;gt;% ml_approx_similarity_join(model_lsh, 
    trained, ., threshold = 1, dist_col = &amp;quot;dist&amp;quot;)
hashed&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;# Source: spark&amp;lt;?&amp;gt; [?? x 3]
    id_a  id_b  dist
   &amp;lt;int&amp;gt; &amp;lt;int&amp;gt; &amp;lt;dbl&amp;gt;
 1    12   863 0.904
 2    16   459 0.557
 3    29   728 0.266
 4    29    33 0.265
 5    37   245 0.479
 6    45    33 0.788
 7    48   728 0.265
 8    48   369 0.265
 9    48   187 0.848
10    54   519 0.564
# ... with more rows&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;we can now shoose a particular person, say id_b=33, and then find his/her similar persons in the training set. By using the majority vote we decide if that person is survived or not.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;m &amp;lt;- 33
ids_train &amp;lt;- hashed %&amp;gt;% filter(id_b == m) %&amp;gt;% pull(id_a)
df1 &amp;lt;- train %&amp;gt;% filter(id %in% ids_train)
df2 &amp;lt;- test %&amp;gt;% filter(id == m)
df &amp;lt;- sdf_bind_rows(df1, df2)
df&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;# Source: spark&amp;lt;?&amp;gt; [?? x 9]
      id Survived Pclass Sex      Age SibSp Parch  Fare Embarked
   &amp;lt;int&amp;gt;    &amp;lt;int&amp;gt;  &amp;lt;int&amp;gt; &amp;lt;chr&amp;gt;  &amp;lt;dbl&amp;gt; &amp;lt;int&amp;gt; &amp;lt;int&amp;gt; &amp;lt;dbl&amp;gt; &amp;lt;chr&amp;gt;   
 1    29        1      3 female    28     0     0  7.88 Q       
 2    45        1      3 female    19     0     0  7.88 Q       
 3    48        1      3 female    28     0     0  7.75 Q       
 4    83        1      3 female    28     0     0  7.79 Q       
 5   199        1      3 female    28     0     0  7.75 Q       
 6   275        1      3 female    28     0     0  7.75 Q       
 7   290        1      3 female    22     0     0  7.75 Q       
 8   301        1      3 female    28     0     0  7.75 Q       
 9   360        1      3 female    28     0     0  7.88 Q       
10   574        1      3 female    28     0     0  7.75 Q       
# ... with more rows&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The last row in this table contains our test instance 33, and it has 17 neighbours from the training data with mixture of died and survived persons.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;df %&amp;gt;% filter(id != m) %&amp;gt;% select(Survived) %&amp;gt;% collect() %&amp;gt;% table()&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## .
##  0  1 
##  5 12&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Using the majority vote this person will be classified as survived since the non survived persons number (5) is less than survived persons number (12), and hence this person is correctly classified.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;similarity-based-on-the-number-of-nearest-neighbours&#34; class=&#34;section level2&#34; number=&#34;3.2&#34;&gt;
&lt;h2&gt;&lt;span class=&#34;header-section-number&#34;&gt;3.2&lt;/span&gt; Similarity based on the number of nearest neighbours&lt;/h2&gt;
&lt;p&gt;Using the same above steps but here with the function&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;ml_approx_nearest_neighbors&lt;/strong&gt; we can predict any point. for example let’s take our previous passenger 120 in the testing set. But first we have to extract the values related to this person from the transformed testing set &lt;strong&gt;tested&lt;/strong&gt;.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;id_input &amp;lt;- tested %&amp;gt;% filter(id == m) %&amp;gt;% pull(scaled) %&amp;gt;% unlist()
id_input&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;##  [1]  0.00000000 -0.56054485 -0.50652969 -1.42132034 -0.07744921 -0.49874843
##  [7] -0.47508853 -0.54740838 -1.79973402 -0.41903250&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;These are the values of all the standardized vectors in the column &lt;strong&gt;scaled&lt;/strong&gt; that will be used to get its closest neighbours in the training data, and here we specify the number of neighbours to be 7.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;knn &amp;lt;- ml_approx_nearest_neighbors(model_lsh, trained, key = id_input, dist_col = &amp;quot;dist&amp;quot;, 
    num_nearest_neighbors = 7)
knn&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;# Source: spark&amp;lt;?&amp;gt; [?? x 5]
     id Survived scaled     hash        dist
  &amp;lt;int&amp;gt;    &amp;lt;int&amp;gt; &amp;lt;list&amp;gt;     &amp;lt;list&amp;gt;     &amp;lt;dbl&amp;gt;
1   698        1 &amp;lt;dbl [10]&amp;gt; &amp;lt;list [5]&amp;gt; 0.265
2    48        1 &amp;lt;dbl [10]&amp;gt; &amp;lt;list [5]&amp;gt; 0.265
3   275        1 &amp;lt;dbl [10]&amp;gt; &amp;lt;list [5]&amp;gt; 0.265
4   199        1 &amp;lt;dbl [10]&amp;gt; &amp;lt;list [5]&amp;gt; 0.265
5   301        1 &amp;lt;dbl [10]&amp;gt; &amp;lt;list [5]&amp;gt; 0.265
6   574        1 &amp;lt;dbl [10]&amp;gt; &amp;lt;list [5]&amp;gt; 0.265
7   265        0 &amp;lt;dbl [10]&amp;gt; &amp;lt;list [5]&amp;gt; 0.265&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Theses are the neighbours of our passenger with thir id’s. We can get the fraction of surived ones as follows.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;n &amp;lt;- sdf_nrow(knn)
pred &amp;lt;- knn %&amp;gt;% select(Survived) %&amp;gt;% summarise(p = sum(Survived)/n)
pred&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## # Source: spark&amp;lt;?&amp;gt; [?? x 1]
##       p
##   &amp;lt;dbl&amp;gt;
## 1 0.857&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Since this probability is greater than 0.5 we predict this passenger as survived, and here also is correctly classified. however, in some cases we can get different predictions.&lt;/p&gt;
&lt;p&gt;To get the accuracy of the whole testing set, we use the following for loop, which requires a lot of computing time since at the end of each iteration we collect the results into R. Consequently it will not usefull with large dataset.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;mypred &amp;lt;- numeric(0)
M &amp;lt;- tested %&amp;gt;% collect() %&amp;gt;% .$id
for (i in M) {
    id_input &amp;lt;- tested %&amp;gt;% filter(id == i) %&amp;gt;% pull(scaled) %&amp;gt;% unlist()
    knn &amp;lt;- ml_approx_nearest_neighbors(model_lsh, trained, key = id_input, dist_col = &amp;quot;dist&amp;quot;, 
        num_nearest_neighbors = 7)
    n &amp;lt;- sdf_nrow(knn)
    pred &amp;lt;- knn %&amp;gt;% select(Survived) %&amp;gt;% summarise(p = sum(Survived)/n) %&amp;gt;% collect()
    mypred &amp;lt;- rbind(mypred, pred)
}
mypred&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;# A tibble: 200 x 1
       p
   &amp;lt;dbl&amp;gt;
 1 0.286
 2 1    
 3 0.571
 4 0    
 5 0.143
 6 0.857
 7 0.429
 8 1    
 9 1    
10 0.857
# ... with 190 more rows&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Now first we convert the probabilities into class labels, next we join this data frame with the testing data, and finally we use the function &lt;strong&gt;confusionmatrix&lt;/strong&gt; from &lt;strong&gt;caret&lt;/strong&gt; package.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;tested_R &amp;lt;- tested %&amp;gt;% select(Survived) %&amp;gt;% collect()
new &amp;lt;- cbind(mypred, tested_R) %&amp;gt;% mutate(predicted = ifelse(p &amp;gt; 0.5, &amp;quot;1&amp;quot;, &amp;quot;0&amp;quot;))
caret::confusionMatrix(as.factor(new$Survived), as.factor(new$predicted))&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;Confusion Matrix and Statistics

          Reference
Prediction   0   1
         0 109  12
         1  30  49
                                          
               Accuracy : 0.79            
                 95% CI : (0.7269, 0.8443)
    No Information Rate : 0.695           
    P-Value [Acc &amp;gt; NIR] : 0.001704        
                                          
                  Kappa : 0.5425          
                                          
 Mcnemar&amp;#39;s Test P-Value : 0.008712        
                                          
            Sensitivity : 0.7842          
            Specificity : 0.8033          
         Pos Pred Value : 0.9008          
         Neg Pred Value : 0.6203          
             Prevalence : 0.6950          
         Detection Rate : 0.5450          
   Detection Prevalence : 0.6050          
      Balanced Accuracy : 0.7937          
                                          
       &amp;#39;Positive&amp;#39; Class : 0               
                                          &lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The accuracy rate is pretty good with 79%.&lt;/p&gt;
&lt;p&gt;Finally, do not forget to dsiconnect when your work is completed.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;spark_disconnect(sc)&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div id=&#34;conclusion&#34; class=&#34;section level1&#34; number=&#34;4&#34;&gt;
&lt;h1&gt;&lt;span class=&#34;header-section-number&#34;&gt;4&lt;/span&gt; Conclusion&lt;/h1&gt;
&lt;p&gt;The LSH model is an approximation of knn when we have large dataset. We could increase the model performance by playing around with the threshold value or the number of the neighbours.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;further-reading&#34; class=&#34;section level1&#34; number=&#34;5&#34;&gt;
&lt;h1&gt;&lt;span class=&#34;header-section-number&#34;&gt;5&lt;/span&gt; Further reading&lt;/h1&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&#34;https://therinspark.com&#34; class=&#34;uri&#34;&gt;https://therinspark.com&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://spark.apache.org/docs/latest/ml-features.html#locality-sensitive-hashing&#34; class=&#34;uri&#34;&gt;https://spark.apache.org/docs/latest/ml-features.html#locality-sensitive-hashing&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
&lt;div id=&#34;session-information&#34; class=&#34;section level1&#34; number=&#34;6&#34;&gt;
&lt;h1&gt;&lt;span class=&#34;header-section-number&#34;&gt;6&lt;/span&gt; Session information&lt;/h1&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;sessionInfo()&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## R version 4.0.1 (2020-06-06)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 19041)
## 
## Matrix products: default
## 
## locale:
## [1] LC_COLLATE=English_United States.1252 
## [2] LC_CTYPE=English_United States.1252   
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C                          
## [5] LC_TIME=English_United States.1252    
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] plotrix_3.7-8   forcats_0.5.0   stringr_1.4.0   dplyr_1.0.2    
##  [5] purrr_0.3.4     readr_1.3.1     tidyr_1.1.2     tibble_3.0.3   
##  [9] ggplot2_3.3.2   tidyverse_1.3.0 sparklyr_1.4.0 
## 
## loaded via a namespace (and not attached):
##  [1] nlme_3.1-149         fs_1.5.0             lubridate_1.7.9     
##  [4] httr_1.4.2           rprojroot_1.3-2      tools_4.0.1         
##  [7] backports_1.1.10     utf8_1.1.4           R6_2.4.1            
## [10] rpart_4.1-15         DBI_1.1.0            colorspace_1.4-1    
## [13] nnet_7.3-14          withr_2.3.0          tidyselect_1.1.0    
## [16] compiler_4.0.1       cli_2.0.2            rvest_0.3.6         
## [19] formatR_1.7          forge_0.2.0          xml2_1.3.2          
## [22] bookdown_0.20        scales_1.1.1         askpass_1.1         
## [25] digest_0.6.25        rmarkdown_2.4        base64enc_0.1-3     
## [28] pkgconfig_2.0.3      htmltools_0.5.0      dbplyr_1.4.4        
## [31] htmlwidgets_1.5.2    rlang_0.4.7          readxl_1.3.1        
## [34] rstudioapi_0.11      generics_0.0.2       jsonlite_1.7.1      
## [37] ModelMetrics_1.2.2.2 config_0.3           magrittr_1.5        
## [40] Matrix_1.2-18        Rcpp_1.0.5           munsell_0.5.0       
## [43] fansi_0.4.1          lifecycle_0.2.0      pROC_1.16.2         
## [46] stringi_1.5.3        yaml_2.2.1           MASS_7.3-53         
## [49] plyr_1.8.6           recipes_0.1.13       grid_4.0.1          
## [52] blob_1.2.1           parallel_4.0.1       crayon_1.3.4        
## [55] lattice_0.20-41      haven_2.3.1          splines_4.0.1       
## [58] hms_0.5.3            knitr_1.30           pillar_1.4.6        
## [61] uuid_0.1-4           stats4_4.0.1         reshape2_1.4.4      
## [64] codetools_0.2-16     reprex_0.3.0         glue_1.4.2          
## [67] evaluate_0.14        blogdown_0.20        data.table_1.13.0   
## [70] modelr_0.1.8         vctrs_0.3.4          foreach_1.5.0       
## [73] cellranger_1.1.0     gtable_0.3.0         openssl_1.4.3       
## [76] assertthat_0.2.1     r2d3_0.2.3           xfun_0.18           
## [79] gower_0.2.2          prodlim_2019.11.13   broom_0.7.1         
## [82] e1071_1.7-3          class_7.3-17         survival_3.2-7      
## [85] timeDate_3043.102    iterators_1.0.12     lava_1.6.8          
## [88] ellipsis_0.3.1       caret_6.0-86         ipred_0.9-9&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
</description>
    </item>
    
    <item>
      <title>Bayesian linear regression</title>
      <link>https://modelingwithr.rbind.io/bayes/bayesian-linear-regression/</link>
      <pubDate>Sat, 25 Apr 2020 00:00:00 +0000</pubDate>
      <guid>https://modelingwithr.rbind.io/bayes/bayesian-linear-regression/</guid>
      <description>
&lt;script src=&#34;https://modelingwithr.rbind.io/rmarkdown-libs/header-attrs/header-attrs.js&#34;&gt;&lt;/script&gt;

&lt;div id=&#34;TOC&#34;&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&#34;#introduction&#34;&gt;&lt;span class=&#34;toc-section-number&#34;&gt;1&lt;/span&gt; Introduction&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#data-preparation&#34;&gt;&lt;span class=&#34;toc-section-number&#34;&gt;2&lt;/span&gt; Data preparation&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#classical-linear-regression-model&#34;&gt;&lt;span class=&#34;toc-section-number&#34;&gt;3&lt;/span&gt; Classical linear regression model&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#bayesian-regression&#34;&gt;&lt;span class=&#34;toc-section-number&#34;&gt;4&lt;/span&gt; Bayesian regression&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#bayesian-inferences&#34;&gt;&lt;span class=&#34;toc-section-number&#34;&gt;5&lt;/span&gt; Bayesian inferences&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#pd-and-p-value&#34;&gt;&lt;span class=&#34;toc-section-number&#34;&gt;6&lt;/span&gt; PD and P-value&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;

&lt;style type=&#34;text/css&#34;&gt;
strong {
  color: Navy;
}

h1,h2, h3, h4 {
  font-size:28px;
  color:DarkBlue;
}
&lt;/style&gt;
&lt;div id=&#34;introduction&#34; class=&#34;section level1&#34; number=&#34;1&#34;&gt;
&lt;h1&gt;&lt;span class=&#34;header-section-number&#34;&gt;1&lt;/span&gt; Introduction&lt;/h1&gt;
&lt;p&gt;For statistical inferences we have tow general approaches or frameworks:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Frequentist&lt;/strong&gt; approach in which the data sampled from the population is considered as random and the population parameter values, known as null hypothesis, as fixed (but unknown). To estimate thus this null hypothesis we look for the sample parameters that maximize the likelihood of the data. However, the data at hand, even it is sampled randomly from the population, it is fixed now, so how can we consider this data as random. The answer is that we assume that the population distribution is known and we work out the maximum likelihood of the data using this distribution. Or we repeat the study many times with different samples then we average the results. So if we get very small value for the likelihood of the data which is known as &lt;strong&gt;p-value&lt;/strong&gt; we tend to reject the null hypothesis.
The main problem, however, is the misunderstanding and misusing of this p-value when we decide to reject the null hypothesis based on some threshold, from which we wrongly interpreting it as the probability of rejecting the null hypothesis. For more detail about p-value click &lt;a href=&#34;http://www.statlit.org/pdf/2016-Neath-ASA.pdf&#34;&gt;here&lt;/a&gt;.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Bayesian&lt;/strong&gt; approach, in contrast, provides true probabilities to quantify the uncertainty about a certain hypothesis, but requires the use of a first belief about how likely this hypothesis is true, known as &lt;strong&gt;prior&lt;/strong&gt;, to be able to derive the probability of this hypothesis after seeing the data known as &lt;strong&gt;posterior probability&lt;/strong&gt;. This approach called bayesian because it is based on the &lt;a href=&#34;https://www.probabilisticworld.com/what-is-bayes-theorem/&#34;&gt;bayes’ theorem&lt;/a&gt;, for instance if a have population parameter to estimate &lt;span class=&#34;math inline&#34;&gt;\(\theta\)&lt;/span&gt; , and we have some data sampled randomly from this population &lt;span class=&#34;math inline&#34;&gt;\(D\)&lt;/span&gt;, the posterior probability thus will be &lt;span class=&#34;math display&#34;&gt;\[\overbrace{p(\theta/D)}^{Posterior}=\frac{\overbrace{p(D/\theta)}^{Likelihood}.\overbrace{p(\theta)}^{Prior}}{\underbrace{p(D)}_{Evidence}}\]&lt;/span&gt;
The &lt;strong&gt;Evidence&lt;/strong&gt; is the probability of the data at hand regardless the parameter &lt;span class=&#34;math inline&#34;&gt;\(\theta\)&lt;/span&gt;.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
&lt;div id=&#34;data-preparation&#34; class=&#34;section level1&#34; number=&#34;2&#34;&gt;
&lt;h1&gt;&lt;span class=&#34;header-section-number&#34;&gt;2&lt;/span&gt; Data preparation&lt;/h1&gt;
&lt;p&gt;For simplicity we use the &lt;strong&gt;BostonHousing&lt;/strong&gt; data from &lt;strong&gt;mlbench&lt;/strong&gt; package, For more detail about this data run this command &lt;code&gt;?BostonHousing&lt;/code&gt; after calling the package. But first Let’s call all the packages that we need throughout this article.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;options(warn = -1)
library(mlbench)
library(rstanarm)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## Loading required package: Rcpp&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## This is rstanarm version 2.21.1&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## - See https://mc-stan.org/rstanarm/articles/priors for changes to default priors!&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## - Default priors may change, so it&amp;#39;s safest to specify priors, even if equivalent to the defaults.&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## - For execution on a local, multicore CPU with excess RAM we recommend calling&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;##   options(mc.cores = parallel::detectCores())&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;library(bayestestR)
library(bayesplot)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## This is bayesplot version 1.7.2&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## - Online documentation and vignettes at mc-stan.org/bayesplot&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## - bayesplot theme set to bayesplot::theme_default()&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;##    * Does _not_ affect other ggplot2 plots&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;##    * See ?bayesplot_theme_set for details on theme setting&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;library(insight)
library(broom)&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;data(&amp;quot;BostonHousing&amp;quot;)
str(BostonHousing)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## &amp;#39;data.frame&amp;#39;:    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : Factor w/ 2 levels &amp;quot;0&amp;quot;,&amp;quot;1&amp;quot;: 1 1 1 1 1 1 1 1 1 1 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : num  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ b      : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;To well understand how the Bayesian regression works we keep only three features, two numeric variables &lt;strong&gt;age&lt;/strong&gt;, &lt;strong&gt;dis&lt;/strong&gt; and one categorical &lt;strong&gt;chas&lt;/strong&gt;, with the target variable &lt;strong&gt;medv&lt;/strong&gt; the median value of owner-occupied homes.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;bost &amp;lt;- BostonHousing[, c(&amp;quot;medv&amp;quot;, &amp;quot;age&amp;quot;, &amp;quot;dis&amp;quot;, &amp;quot;chas&amp;quot;)]
summary(bost)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;##       medv            age              dis         chas   
##  Min.   : 5.00   Min.   :  2.90   Min.   : 1.130   0:471  
##  1st Qu.:17.02   1st Qu.: 45.02   1st Qu.: 2.100   1: 35  
##  Median :21.20   Median : 77.50   Median : 3.207          
##  Mean   :22.53   Mean   : 68.57   Mean   : 3.795          
##  3rd Qu.:25.00   3rd Qu.: 94.08   3rd Qu.: 5.188          
##  Max.   :50.00   Max.   :100.00   Max.   :12.127&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;From the summary we do not have any special issues like missing values for example.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;classical-linear-regression-model&#34; class=&#34;section level1&#34; number=&#34;3&#34;&gt;
&lt;h1&gt;&lt;span class=&#34;header-section-number&#34;&gt;3&lt;/span&gt; Classical linear regression model&lt;/h1&gt;
&lt;p&gt;To highlight the difference between the bayesian regression and the traditional linear regression (frequentist approach), Let’s first fit the latter to our data.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;model_freq &amp;lt;- lm(medv ~ ., data = bost)
tidy(model_freq)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## # A tibble: 4 x 5
##   term        estimate std.error statistic  p.value
##   &amp;lt;chr&amp;gt;          &amp;lt;dbl&amp;gt;     &amp;lt;dbl&amp;gt;     &amp;lt;dbl&amp;gt;    &amp;lt;dbl&amp;gt;
## 1 (Intercept)   32.7      2.25      14.6   2.33e-40
## 2 age           -0.143    0.0198    -7.21  2.09e-12
## 3 dis           -0.246    0.265     -0.928 3.54e- 1
## 4 chas1          7.51     1.46       5.13  4.16e- 7&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Using the p.value of each regressor, all the regressors ar significant. except for the &lt;strong&gt;dis&lt;/strong&gt; variable. Since the variable &lt;strong&gt;chas&lt;/strong&gt; is categorical with twolevels The coefficient of &lt;strong&gt;chas1&lt;/strong&gt; is the different between the madian price of houses on the bounds charles River and that of the others, so the median price of the former are higher about 7.513.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;bayesian-regression&#34; class=&#34;section level1&#34; number=&#34;4&#34;&gt;
&lt;h1&gt;&lt;span class=&#34;header-section-number&#34;&gt;4&lt;/span&gt; Bayesian regression&lt;/h1&gt;
&lt;p&gt;To fit a bayesian regresion we use the function &lt;code&gt;stan_glm&lt;/code&gt; from the &lt;a href=&#34;https://cran.r-project.org/web/packages/rstanarm/rstanarm.pdf&#34;&gt;rstanarm&lt;/a&gt; package. This function as the above &lt;strong&gt;lm&lt;/strong&gt; function requires providing the &lt;strong&gt;formula&lt;/strong&gt; and the data that will be used, and leave all the following arguments with their default values:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;family&lt;/strong&gt; : by default this function uses the &lt;strong&gt;gaussian&lt;/strong&gt; distribution as we do with the classical &lt;code&gt;glm&lt;/code&gt; function to perform &lt;code&gt;lm&lt;/code&gt; model.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;prior&lt;/strong&gt; : The prior distribution for the regression coefficients, By default the normal prior is used. There are subset of functions used for the prior provided by rstanarm like , &lt;strong&gt;student t family&lt;/strong&gt;, &lt;strong&gt;laplace family&lt;/strong&gt;…ect. To get the full list with all the details run this command &lt;code&gt;?priors&lt;/code&gt;. If we want a flat uniform prior we set this to &lt;strong&gt;NULL&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;prior_intercept&lt;/strong&gt;: prior for the intercept, can be normal, student_t , or cauchy. If we want a flat uniform prior we set this to &lt;strong&gt;NULL&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;prior_aux&lt;/strong&gt;: prior fo auxiliary parameters such as the error standard deviation for the gaussion family.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;algorithm&lt;/strong&gt;: The estimating approach to use. The default is &#34;sampling MCMC&lt;a href=&#34;#fn1&#34; class=&#34;footnote-ref&#34; id=&#34;fnref1&#34;&gt;&lt;sup&gt;1&lt;/sup&gt;&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;QR&lt;/strong&gt;: FALSE by default, if true QR decomposition applied on the design matrix if we have large number of predictors.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;iter&lt;/strong&gt; : is the number of iterations if the MCMC method is used, the default is 2000.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;chains&lt;/strong&gt; : the number of Markov chains, the default is 4.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;warmup&lt;/strong&gt; : also known as burnin, the number of iterations used for adaptation, and should not be used for inference. By default it is half of the iterations.&lt;/li&gt;
&lt;/ul&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;model_bayes &amp;lt;- stan_glm(medv ~ ., data = bost, seed = 111)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;if we print the model we get the following&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;print(model_bayes, digits = 3)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## stan_glm
##  family:       gaussian [identity]
##  formula:      medv ~ .
##  observations: 506
##  predictors:   4
## ------
##             Median MAD_SD
## (Intercept) 32.834  2.285
## age         -0.143  0.020
## dis         -0.258  0.257
## chas1        7.543  1.432
## 
## Auxiliary parameter(s):
##       Median MAD_SD
## sigma 8.324  0.260 
## 
## ------
## * For help interpreting the printed output see ?print.stanreg
## * For info on the priors used see ?prior_summary.stanreg&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The &lt;strong&gt;Median&lt;/strong&gt; estimate is the median computed from the MCMC simulation, and &lt;strong&gt;MAD_SD&lt;/strong&gt; is the median absolute deviation computed from the same simulation. To well understand how getting these outputs let’s plot the MCMC simulation of each predictor using &lt;a href=&#34;https://cran.r-project.org/web/packages/bayesplot/bayesplot.pdf&#34;&gt;bayesplot&lt;/a&gt;&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;mcmc_dens(model_bayes, pars = c(&amp;quot;age&amp;quot;)) + vline_at(-0.143, col = &amp;quot;red&amp;quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://modelingwithr.rbind.io/bayes/2020-04-25-bayesian-linear-regression_files/figure-html/unnamed-chunk-8-1.svg&#34; width=&#34;576&#34; /&gt;&lt;/p&gt;
&lt;p&gt;As you see the point estimate of &lt;strong&gt;age&lt;/strong&gt; falls on the median of this distribution (red line). The same thing is true for &lt;strong&gt;dis&lt;/strong&gt; and &lt;strong&gt;shas&lt;/strong&gt; predictors.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;mcmc_dens(model_bayes, pars = c(&amp;quot;chas1&amp;quot;)) + vline_at(7.496, col = &amp;quot;red&amp;quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://modelingwithr.rbind.io/bayes/2020-04-25-bayesian-linear-regression_files/figure-html/unnamed-chunk-9-1.svg&#34; width=&#34;576&#34; /&gt;&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;mcmc_dens(model_bayes, pars = c(&amp;quot;dis&amp;quot;)) + vline_at(-0.244, col = &amp;quot;red&amp;quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://modelingwithr.rbind.io/bayes/2020-04-25-bayesian-linear-regression_files/figure-html/unnamed-chunk-10-1.svg&#34; width=&#34;576&#34; /&gt;&lt;/p&gt;
&lt;p&gt;Now how can we evaluate the model parameters? The answer is by analyzing the posteriors using some specific statistics. To get the full statistics provided by &lt;a href=&#34;https://cran.r-project.org/web/packages/bayestestR/bayestestR.pdf&#34;&gt;bayestestR&lt;/a&gt; package, we make use of the function &lt;code&gt;describe_posterior&lt;/code&gt;.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;describe_posterior(model_bayes)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## Possible multicollinearity between dis and age (r = 0.76). This might lead to inappropriate results. See &amp;#39;Details&amp;#39; in &amp;#39;?rope&amp;#39;.&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## # Description of Posterior Distributions
## 
## Parameter   | Median |           89% CI |    pd |        89% ROPE | % in ROPE |  Rhat |      ESS
## ------------------------------------------------------------------------------------------------
## (Intercept) | 32.834 | [29.218, 36.295] | 1.000 | [-0.920, 0.920] |         0 | 1.002 | 2029.279
## age         | -0.143 | [-0.175, -0.112] | 1.000 | [-0.920, 0.920] |       100 | 1.001 | 2052.155
## dis         | -0.258 | [-0.667,  0.179] | 0.819 | [-0.920, 0.920] |       100 | 1.002 | 2115.192
## chas1       |  7.543 | [ 5.159,  9.813] | 1.000 | [-0.920, 0.920] |         0 | 1.000 | 3744.403&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Before starting analyzing the table we shoud first understanding the above various statistics commonly used in bayes regression.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;CI&lt;/strong&gt; : &lt;a href=&#34;https://freakonometrics.hypotheses.org/18117&#34;&gt;Credible Interval&lt;/a&gt;, it is used to quantify the uncertainty about the regression coefficients. Ther are tow methods to compute &lt;strong&gt;CI&lt;/strong&gt;, the &lt;a href=&#34;https://www.sciencedirect.com/topics/mathematics/highest-density-interval&#34;&gt;highest density interval&lt;/a&gt; &lt;code&gt;HDI&lt;/code&gt; which is the default, and the &lt;a href=&#34;https://www.sciencedirect.com/topics/mathematics/credible-interval&#34;&gt;Equal-tailed Interval&lt;/a&gt; &lt;code&gt;ETI&lt;/code&gt;. with 89% probability (given the data) that a coefficient lies above the &lt;strong&gt;CI_low&lt;/strong&gt; value and under &lt;strong&gt;CI_high&lt;/strong&gt; value. This strightforward probabilistic interpretation is completely diffrent from the confidence interval used in classical linear regression where the coefficient fall inside this confidence interval (if we choose 95% of confidence) 95 times if we repeat the study 100 times.&lt;br /&gt;
&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;pd&lt;/strong&gt; : &lt;a href=&#34;https://www.r-bloggers.com/the-p-direction-a-bayesian-equivalent-of-the-p-value/&#34;&gt;Probability of Direction&lt;/a&gt; , which is the probability that the effect goes to the positive or to the negative direction, and it is considered as the best equivalent for the p-value.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;ROPE_CI&lt;/strong&gt;: &lt;a href=&#34;https://cran.r-project.org/web/packages/bayestestR/vignettes/region_of_practical_equivalence.html&#34;&gt;Region of Practical Equivalence&lt;/a&gt;, since bayes method deals with true probabilities , it does not make sense to compute the probability of getting the effect equals zero (the null hypothesis) as a point (probability of a point in continuous intervals equal zero ). Thus, we define instead a small range around zero which can be considered practically the same as no effect (zero), this range therefore is called &lt;strong&gt;ROPE&lt;/strong&gt;. By default (according to Cohen, 1988) The Rope is [-0.1,0.1] from the standardized coefficients.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Rhat&lt;/strong&gt;: &lt;a href=&#34;https://arxiv.org/pdf/1903.08008.pdf&#34;&gt;scale reduction factor &lt;span class=&#34;math inline&#34;&gt;\(\hat R\)&lt;/span&gt;&lt;/a&gt;, it is computed for each scalar quantity of interest, as the standard deviation of that quantity from all the chains included together, divided by the root mean square of the separate within-chain standard deviations. When this value is close to 1 we do not have any convergence problem with MCMC.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;ESS&lt;/strong&gt; : &lt;a href=&#34;https://arxiv.org/pdf/1903.08008.pdf&#34;&gt;effective sample size&lt;/a&gt;, it captures how many independent draws contain the same amount of information as the dependent sample obtained by the MCMC algorithm, the higher the ESS the better. The threshold used in practice is 400.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Aternatively, we can get the coefficeient estimates (which are the medians by default) separatly by using the package &lt;strong&gt;insight&lt;/strong&gt;&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;post &amp;lt;- get_parameters(model_bayes)
print(purrr::map_dbl(post, median), digits = 3)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## (Intercept)         age         dis       chas1 
##      32.834      -0.143      -0.258       7.543&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;We can also compute the Maximum A posteriori (MAP), and the mean as follows&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;print(purrr::map_dbl(post, map_estimate), digits = 3)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## (Intercept)         age         dis       chas1 
##      33.025      -0.145      -0.295       7.573&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;print(purrr::map_dbl(post, mean), digits = 3)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## (Intercept)         age         dis       chas1 
##      32.761      -0.143      -0.248       7.523&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;As we see the values are closer to each other due to the like normality of the distribution of the posteriors where all the central statistics (mean, median, mode) are closer to each other.
Using the following plot to visualize the age coefficient using different statistics as follows&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;mcmc_dens(model_bayes, pars = c(&amp;quot;age&amp;quot;)) + vline_at(median(post$age), col = &amp;quot;red&amp;quot;) + 
    vline_at(mean(post$age), col = &amp;quot;yellow&amp;quot;) + vline_at(map_estimate(post$age), col = &amp;quot;green&amp;quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://modelingwithr.rbind.io/bayes/2020-04-25-bayesian-linear-regression_files/figure-html/unnamed-chunk-14-1.svg&#34; width=&#34;576&#34; /&gt;&lt;/p&gt;
&lt;p&gt;As expected they are approximately on top of each other.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;bayesian-inferences&#34; class=&#34;section level1&#34; number=&#34;5&#34;&gt;
&lt;h1&gt;&lt;span class=&#34;header-section-number&#34;&gt;5&lt;/span&gt; Bayesian inferences&lt;/h1&gt;
&lt;p&gt;As we do with classical regression (frequentist), we can test the significance of the bayesian regression coefficients by checking whether the corresponding credible interval contains zero or not, if no then this coefficient is significant. Let’s go back to our model and check the significance of each coefficient (using credible based on the default &lt;code&gt;hdi&lt;/code&gt;).&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;hdi(model_bayes)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## # Highest Density Interval
## 
## Parameter   |        89% HDI
## ----------------------------
## (Intercept) | [29.22, 36.29]
## age         | [-0.18, -0.11]
## dis         | [-0.67,  0.18]
## chas1       | [ 5.16,  9.81]&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;And based on the &lt;code&gt;eti&lt;/code&gt;&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;eti(model_bayes)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## # Equal-Tailed Interval
## 
## Parameter   |        89% ETI
## ----------------------------
## (Intercept) | [29.20, 36.28]
## age         | [-0.17, -0.11]
## dis         | [-0.67,  0.18]
## chas1       | [ 5.17,  9.83]&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Using both methods, the only non significant coefficient is &lt;strong&gt;dis&lt;/strong&gt; variable, which is inline with the classical regression.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Note&lt;/strong&gt;: this similar result between frequentist and bayesian regression may due to the normality assumption for the former that is well satisfied which gives satisfied results and due to the normal prior used in the latter. However, in real world it is less often to be sure about the normality assumption which may give contradict conclusions between the two approaches.&lt;/p&gt;
&lt;p&gt;Another way to test the significance by checking the part of the credible interval that falls inside the ROPE interval. we can get this by calling the &lt;code&gt;rope&lt;/code&gt; from &lt;strong&gt;bayestestR&lt;/strong&gt; package&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;rope(post$age)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## # Proportion of samples inside the ROPE [-0.10, 0.10]:
## 
## inside ROPE
## -----------
## 0.00 %&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;For age almost all the credible interval (HDI) is outside the ROPE range, which means that coefficient is highly significant.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;rope(post$chas1)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## # Proportion of samples inside the ROPE [-0.10, 0.10]:
## 
## inside ROPE
## -----------
## 0.00 %&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;rope(post$`(Intercept)`)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## # Proportion of samples inside the ROPE [-0.10, 0.10]:
## 
## inside ROPE
## -----------
## 0.00 %&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The same thing is true for the &lt;strong&gt;chas&lt;/strong&gt; and &lt;strong&gt;intercept&lt;/strong&gt; variable.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;rope(post$dis)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## # Proportion of samples inside the ROPE [-0.10, 0.10]:
## 
## inside ROPE
## -----------
## 20.02 %&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;In contrast, almost the quarter of the credible interval of &lt;strong&gt;dis&lt;/strong&gt; variable is inside the ROPE interval. In other words, the probability of this coefficient to be zero is 23.28%.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;rope_range(model_bayes)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## [1] -0.9197104  0.9197104&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div id=&#34;pd-and-p-value&#34; class=&#34;section level1&#34; number=&#34;6&#34;&gt;
&lt;h1&gt;&lt;span class=&#34;header-section-number&#34;&gt;6&lt;/span&gt; PD and P-value&lt;/h1&gt;
&lt;p&gt;Sometimes we are only interested to check the direction of the coefficient (positive or negative). this is the role of &lt;strong&gt;pd&lt;/strong&gt; statistic in the above table, high value means that the associated effect is concentrated on the same side as the median. For our model, since pd’s equal to 1, almost all the posteriors of the two variables &lt;strong&gt;age&lt;/strong&gt; and &lt;strong&gt;chas1&lt;/strong&gt; and the intercept are on the same side (if median negative all other values are negatives). However, it should be noted that this statistic does not assess the significance of the effect.
Something more important to mention is that it exists a strong relation between this probability and the p-value approximated as follows: &lt;span class=&#34;math inline&#34;&gt;\(p-value=1-pd\)&lt;/span&gt;. let’s check this with our variables.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;df1 &amp;lt;- dplyr::select(tidy(model_freq), c(term, p.value))
df1$p.value &amp;lt;- round(df1$p.value, digits = 3)
df2 &amp;lt;- 1 - purrr::map_dbl(post, p_direction)
df &amp;lt;- cbind(df1, df2)
df&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;##                    term p.value     df2
## (Intercept) (Intercept)   0.000 0.00000
## age                 age   0.000 0.00000
## dis                 dis   0.354 0.18075
## chas1             chas1   0.000 0.00000&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;# Conclusion&lt;/p&gt;
&lt;p&gt;within the last decade more practitioner , specially in some fields such as medicine and psychology, are turning towards bayesian analysis since almost every thing can be interpreted straightforwardly with a probabilistic manner. However, the Bayesian analysis has also some drawback , like the subjective way to define the priors (which play an important role to compute the posterior), or for problems that do not have conjugate prior, not always the mcmc alghorithm converges easily to the right values (specially with complex data).&lt;/p&gt;
&lt;/div&gt;
&lt;div class=&#34;footnotes&#34;&gt;
&lt;hr /&gt;
&lt;ol&gt;
&lt;li id=&#34;fn1&#34;&gt;&lt;p&gt;Kevin P.murphy, Machine Learning: A Probabilistic Perspective, 2012, page 589&lt;a href=&#34;#fnref1&#34; class=&#34;footnote-back&#34;&gt;↩︎&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;/div&gt;
</description>
    </item>
    
    <item>
      <title>Predicting images using Convolutional neural network</title>
      <link>https://modelingwithr.rbind.io/courses/cnn_imag/cnn_imag/</link>
      <pubDate>Sat, 25 Apr 2020 00:00:00 +0000</pubDate>
      <guid>https://modelingwithr.rbind.io/courses/cnn_imag/cnn_imag/</guid>
      <description>
&lt;script src=&#34;https://modelingwithr.rbind.io/rmarkdown-libs/header-attrs/header-attrs.js&#34;&gt;&lt;/script&gt;

&lt;div id=&#34;TOC&#34;&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&#34;#introduction&#34;&gt;&lt;span class=&#34;toc-section-number&#34;&gt;1&lt;/span&gt; Introduction&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#data-preparation&#34;&gt;&lt;span class=&#34;toc-section-number&#34;&gt;2&lt;/span&gt; Data preparation&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#training-the-model&#34;&gt;&lt;span class=&#34;toc-section-number&#34;&gt;3&lt;/span&gt; Training the model:&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#model-evaluation&#34;&gt;&lt;span class=&#34;toc-section-number&#34;&gt;4&lt;/span&gt; Model Evaluation&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#prediction&#34;&gt;&lt;span class=&#34;toc-section-number&#34;&gt;5&lt;/span&gt; Prediction&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#conclusion&#34;&gt;&lt;span class=&#34;toc-section-number&#34;&gt;6&lt;/span&gt; Conclusion&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;

&lt;style type=&#34;text/css&#34;&gt;
strong {
  color: Navy;
}

h1,h2, h3, h4 {
  font-size:28px;
  color:DarkBlue;
}
&lt;/style&gt;
&lt;div id=&#34;introduction&#34; class=&#34;section level1&#34; number=&#34;1&#34;&gt;
&lt;h1&gt;&lt;span class=&#34;header-section-number&#34;&gt;1&lt;/span&gt; Introduction&lt;/h1&gt;
&lt;p&gt;In this article we will make use of the convolutional neural network, the most widely deep learning method used for image classification, object detection,..etc&lt;a href=&#34;#fn1&#34; class=&#34;footnote-ref&#34; id=&#34;fnref1&#34;&gt;&lt;sup&gt;1&lt;/sup&gt;&lt;/a&gt;. For more detail about how it works please click &lt;a href=&#34;https://docs.google.com/presentation/d/1f7yAMxElPorSAdy3iiIBWw6Py20uiu2_xK4lmeB9Dpk/edit?usp=sharing&#34;&gt;here&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;We are going be learning how to build and train &lt;strong&gt;convolutional neural network&lt;/strong&gt; model using small sample of images collected from google search. The data includes 30 images, each of which is either one of three types of animals: &lt;strong&gt;cat&lt;/strong&gt;, &lt;strong&gt;dog&lt;/strong&gt;, or &lt;strong&gt;lion&lt;/strong&gt;, and each one has equally number of images, that is 10.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;data-preparation&#34; class=&#34;section level1&#34; number=&#34;2&#34;&gt;
&lt;h1&gt;&lt;span class=&#34;header-section-number&#34;&gt;2&lt;/span&gt; Data preparation&lt;/h1&gt;
&lt;p&gt;First, we call the packages needed along this paper and load the data into two different objects, one called &lt;strong&gt;train&lt;/strong&gt;, will contain 7 instances of each animal type used for training the model, and another one, called &lt;strong&gt;test&lt;/strong&gt;, will contain the remaining instances for the evaluation of the model performance.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;ssh &amp;lt;- suppressPackageStartupMessages
ssh(library(EBImage))
ssh(library(keras))
ssh(library(foreach))

mytrain &amp;lt;- c(paste0(&amp;quot;../images/cat&amp;quot;,1:7,&amp;quot;.jpg&amp;quot;),paste0(&amp;quot;../images/dog&amp;quot;,1:7,&amp;quot;.jpg&amp;quot;),
        paste0(&amp;quot;../images/lion&amp;quot;,1:7,&amp;quot;.jpg&amp;quot;))

mytest &amp;lt;- c(paste0(&amp;quot;../images/cat&amp;quot;,8:10,&amp;quot;.jpg&amp;quot;),paste0(&amp;quot;../images/dog&amp;quot;,8:10,&amp;quot;.jpg&amp;quot;),
        paste0(&amp;quot;../images/lion&amp;quot;,8:10,&amp;quot;.jpg&amp;quot;))

train &amp;lt;- lapply(mytrain, readImage)
test &amp;lt;- lapply(mytest, readImage)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Now let us first figure out what information each image contains .&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;train[[1]]&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## Image 
##   colorMode    : Color 
##   storage.mode : double 
##   dim          : 275 183 3 
##   frames.total : 3 
##   frames.render: 1 
## 
## imageData(object)[1:5,1:6,1]
##           [,1]      [,2]      [,3]      [,4]      [,5]      [,6]
## [1,] 0.2039216 0.2039216 0.2039216 0.2078431 0.2000000 0.2000000
## [2,] 0.2039216 0.2039216 0.2078431 0.2078431 0.2000000 0.2039216
## [3,] 0.2078431 0.2078431 0.2078431 0.2117647 0.2039216 0.2078431
## [4,] 0.2117647 0.2117647 0.2156863 0.2156863 0.2117647 0.2117647
## [5,] 0.2156863 0.2156863 0.2196078 0.2196078 0.2156863 0.2156863&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;As we see this image is color image with 275 pxl hight, 183 pxl width and 3 channels (RGB) since it is a color image.&lt;/p&gt;
&lt;p&gt;we can visualize an image as follows:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;plot(test[[4]])&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://modelingwithr.rbind.io/courses/cnn_imag/2020-04-25-cnn-imag_files/figure-html/unnamed-chunk-4-1.svg&#34; width=&#34;576&#34; /&gt;&lt;/p&gt;
&lt;p&gt;If instead we want to visualize all the image as one block we can make use of &lt;strong&gt;foreach&lt;/strong&gt; package to apply a for loop as follows.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;par(mfrow=c(7,3))
foreach(i=1:21) %do% {plot(train[[i]])}&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://modelingwithr.rbind.io/courses/cnn_imag/2020-04-25-cnn-imag_files/figure-html/unnamed-chunk-5-1.svg&#34; width=&#34;576&#34; /&gt;&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;par(mfrow=c(1,1))&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;After having taken a brief glance at our data, we found that the image sizes are different from each other which is not what our image classification model expects. That is why, the following script will resize all the images to have the same size &lt;strong&gt;150x150x3&lt;/strong&gt;.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;foreach(i=1:21) %do% {train[[i]] &amp;lt;- resize(train[[i]],150,150)}
foreach(i=1:9) %do% {test[[i]] &amp;lt;- resize(test[[i]],150,150)}&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;To check the result we use the following:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;str(test)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## List of 9
##  $ :Formal class &amp;#39;Image&amp;#39; [package &amp;quot;EBImage&amp;quot;] with 2 slots
##   .. ..@ .Data    : num [1:150, 1:150, 1:3] 0.761 0.78 0.773 0.755 0.768 ...
##   .. ..@ colormode: int 2
##  $ :Formal class &amp;#39;Image&amp;#39; [package &amp;quot;EBImage&amp;quot;] with 2 slots
##   .. ..@ .Data    : num [1:150, 1:150, 1:3] 0.441 0.465 0.496 0.528 0.524 ...
##   .. ..@ colormode: int 2
##  $ :Formal class &amp;#39;Image&amp;#39; [package &amp;quot;EBImage&amp;quot;] with 2 slots
##   .. ..@ .Data    : num [1:150, 1:150, 1:3] 0.986 0.992 0.951 0.945 0.929 ...
##   .. ..@ colormode: int 2
##  $ :Formal class &amp;#39;Image&amp;#39; [package &amp;quot;EBImage&amp;quot;] with 2 slots
##   .. ..@ .Data    : num [1:150, 1:150, 1:3] 0.81 0.751 0.787 0.825 0.508 ...
##   .. ..@ colormode: int 2
##  $ :Formal class &amp;#39;Image&amp;#39; [package &amp;quot;EBImage&amp;quot;] with 2 slots
##   .. ..@ .Data    : num [1:150, 1:150, 1:3] 0.357 0.5 0.49 0.626 0.522 ...
##   .. ..@ colormode: int 2
##  $ :Formal class &amp;#39;Image&amp;#39; [package &amp;quot;EBImage&amp;quot;] with 2 slots
##   .. ..@ .Data    : num [1:150, 1:150, 1:3] 0.375 0.365 0.375 0.397 0.393 ...
##   .. ..@ colormode: int 2
##  $ :Formal class &amp;#39;Image&amp;#39; [package &amp;quot;EBImage&amp;quot;] with 2 slots
##   .. ..@ .Data    : num [1:150, 1:150, 1:3] 0.651 0.57 0.614 0.634 0.63 ...
##   .. ..@ colormode: int 2
##  $ :Formal class &amp;#39;Image&amp;#39; [package &amp;quot;EBImage&amp;quot;] with 2 slots
##   .. ..@ .Data    : num [1:150, 1:150, 1:3] 0.268 0.201 0.198 0.213 0.182 ...
##   .. ..@ colormode: int 2
##  $ :Formal class &amp;#39;Image&amp;#39; [package &amp;quot;EBImage&amp;quot;] with 2 slots
##   .. ..@ .Data    : num [1:150, 1:150, 1:3] 0 0 0 0 0 ...
##   .. ..@ colormode: int 2&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;As we see all the images now have the same size as an array of 3 dimensions. The next step now is to combine all the images as one block.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;trainall &amp;lt;- combine(train)
testall &amp;lt;- combine(test) &lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;We can display the output block usine the following:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;display(tile(trainall,7))&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://modelingwithr.rbind.io/courses/cnn_imag/2020-04-25-cnn-imag_files/figure-html/unnamed-chunk-9-1.svg&#34; width=&#34;576&#34; /&gt;&lt;/p&gt;
&lt;p&gt;Now the images are nicely combined in one block with four dimensions: number of instances (images), height, width, and number of channels, and this is the input that will be used in our model. However, to correctly read the input our model expects that the first dimension is the number of instances, the second is height , the third is width, and the fourth is the number of channels.
Let us check whether the input has the correct order or not.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;str(trainall)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## Formal class &amp;#39;Image&amp;#39; [package &amp;quot;EBImage&amp;quot;] with 2 slots
##   ..@ .Data    : num [1:150, 1:150, 1:3, 1:21] 0.204 0.209 0.216 0.223 0.233 ...
##   ..@ colormode: int 2&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This order is not correct since the number of instances is in the last position, so we reorder the positions as follows:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;trainall &amp;lt;- aperm(trainall, c(4,1,2,3))
testall &amp;lt;- aperm(testall, c(4,1,2,3))&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The Last thing that remains to be done, before customizing the architecture of our model, is to create a variable to hold the image’s labels, then convert it to a dummy variable.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;trainlabels &amp;lt;- rep(0:2, each=7)
testlabels &amp;lt;- rep(0:2, each=3)
trainy &amp;lt;- to_categorical(trainlabels)
testy &amp;lt;- to_categorical(testlabels)&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div id=&#34;training-the-model&#34; class=&#34;section level1&#34; number=&#34;3&#34;&gt;
&lt;h1&gt;&lt;span class=&#34;header-section-number&#34;&gt;3&lt;/span&gt; Training the model:&lt;/h1&gt;
&lt;p&gt;The architecture of our model will contain the following layers:&lt;/p&gt;
&lt;ol style=&#34;list-style-type: decimal&#34;&gt;
&lt;li&gt;Convolution layer that makes use of 32 filters with size 3x3 (since the input has 150x150x3 consequently the third dimension of the filter size will be 3 that is 3x3x3), and with &lt;strong&gt;Relu&lt;/strong&gt; as activation function.&lt;/li&gt;
&lt;li&gt;maxPooling layer of 3x3 with strides=2.&lt;/li&gt;
&lt;li&gt;Convolution layer that makes use of 64 filters with size 5x5 , and with &lt;strong&gt;Relu&lt;/strong&gt; function.&lt;/li&gt;
&lt;li&gt;maxPooling layer of 2x2 with strides=2.&lt;/li&gt;
&lt;li&gt;Convolution layer that makes use of 128 filters with size 3x3 , and with &lt;strong&gt;Relu&lt;/strong&gt; function.&lt;/li&gt;
&lt;li&gt;maxPooling layer of 2x2 with strides=2.&lt;/li&gt;
&lt;li&gt;Flatten layer to collapse all the output elements into one giant vector to be able to connect to the traditional neural network with fully connected layers.&lt;/li&gt;
&lt;li&gt;dense layers composed of 256 nodes and with &lt;strong&gt;leaky_relu&lt;/strong&gt; function. The slope for the negative part will be &lt;strong&gt;0.1&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;Dropout layer with rate of 40%, this acts as regularization method by randomly ignoring 40% of nodes in each epoch (iteration).&lt;/li&gt;
&lt;li&gt;the last output layer with 3 nodes since we have 3 classes and with &lt;strong&gt;softmax&lt;/strong&gt; function.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;In &lt;strong&gt;keras&lt;/strong&gt; package the above steps will be coded as follows:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;model &amp;lt;- keras_model_sequential()

model %&amp;gt;% 
  layer_conv_2d(filters = 32,
                        kernel_size = c(3,3),
                        activation = &amp;quot;relu&amp;quot;,
                        input_shape = c(150,150,3))%&amp;gt;%
  layer_max_pooling_2d(pool_size = c(3,3), strides = 2)%&amp;gt;%
  layer_conv_2d(filters = 64,
               kernel_size = c(5,5),
                activation = &amp;quot;relu&amp;quot;) %&amp;gt;%
  layer_max_pooling_2d(pool_size = c(2,2), strides = 2)%&amp;gt;%
  layer_conv_2d(filters = 128,
                kernel_size = c(3,3),
                activation = &amp;quot;relu&amp;quot;) %&amp;gt;%
  layer_max_pooling_2d(pool_size = c(2,2), strides = 2)%&amp;gt;%
  layer_flatten()%&amp;gt;%
  layer_dense(units=256)%&amp;gt;% layer_activation_leaky_relu(alpha = 0.1)%&amp;gt;%
  layer_dropout(rate=0.4)%&amp;gt;%
  layer_dense(units=3, activation = &amp;quot;softmax&amp;quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;We can figure out this architecture and how many parameters it has by calling the summary function.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;summary(model)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## Model: &amp;quot;sequential&amp;quot;
## ________________________________________________________________________________
## Layer (type)                        Output Shape                    Param #     
## ================================================================================
## conv2d (Conv2D)                     (None, 148, 148, 32)            896         
## ________________________________________________________________________________
## max_pooling2d (MaxPooling2D)        (None, 73, 73, 32)              0           
## ________________________________________________________________________________
## conv2d_1 (Conv2D)                   (None, 69, 69, 64)              51264       
## ________________________________________________________________________________
## max_pooling2d_1 (MaxPooling2D)      (None, 34, 34, 64)              0           
## ________________________________________________________________________________
## conv2d_2 (Conv2D)                   (None, 32, 32, 128)             73856       
## ________________________________________________________________________________
## max_pooling2d_2 (MaxPooling2D)      (None, 16, 16, 128)             0           
## ________________________________________________________________________________
## flatten (Flatten)                   (None, 32768)                   0           
## ________________________________________________________________________________
## dense (Dense)                       (None, 256)                     8388864     
## ________________________________________________________________________________
## leaky_re_lu (LeakyReLU)             (None, 256)                     0           
## ________________________________________________________________________________
## dropout (Dropout)                   (None, 256)                     0           
## ________________________________________________________________________________
## dense_1 (Dense)                     (None, 3)                       771         
## ================================================================================
## Total params: 8,515,651
## Trainable params: 8,515,651
## Non-trainable params: 0
## ________________________________________________________________________________&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;As we see we have huge number of parameters &lt;strong&gt;8 515 651&lt;/strong&gt;. Since the data has only 21 instances, the computation process in my laptop takes only few seconds. However, with large data set this model may take more time.&lt;/p&gt;
&lt;p&gt;The last step before running the model is to specify the loss function, the optimizer and the metric.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;p&gt;For multiclassification problem the most widely used one is &lt;a href=&#34;https://towardsdatascience.com/common-loss-functions-in-machine-learning-46af0ffc4d23&#34;&gt;categorical cross entropy&lt;/a&gt;.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Besides the popular &lt;strong&gt;gradient descent&lt;/strong&gt; &lt;a href=&#34;https://keras.io/optimizers/&#34;&gt;optimizer&lt;/a&gt; (with its versions , &lt;strong&gt;stochastic gradient descent&lt;/strong&gt; and &lt;strong&gt;mini batch gradient descent&lt;/strong&gt;), there exist other ones such as &lt;strong&gt;adam&lt;/strong&gt; , &lt;strong&gt;adadelta&lt;/strong&gt;, &lt;strong&gt;mrsprop&lt;/strong&gt; (the first one will be used for our case). In practice sometimes we finetune the hyperparameters by changing these optimizers.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;For classification problems we have many &lt;a href=&#34;https://keras.io/metrics/&#34;&gt;metrics&lt;/a&gt;, the famous ones are: &lt;strong&gt;accuracy&lt;/strong&gt; (used for our case), &lt;strong&gt;roc&lt;/strong&gt;, &lt;strong&gt;area under roc&lt;/strong&gt;, &lt;strong&gt;precision&lt;/strong&gt;.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;model %&amp;gt;% compile(loss= &amp;quot;categorical_crossentropy&amp;quot;,
                  optimizer=&amp;quot;adam&amp;quot;,
                  metrics=&amp;quot;accuracy&amp;quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;At this stage everything is ready to train our model by calling the function &lt;strong&gt;fit&lt;/strong&gt;. the epoch value is the number of iterations or the gradient descent steps, and the &lt;strong&gt;validation_split&lt;/strong&gt; is the holdout samples used for assessment, here four images. I have run this model before and in oreder to avoide running it again i have commented the script by #, if you want to run it just uncomment the script.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;#history &amp;lt;- model %&amp;gt;%
  #fit(trainall, trainy, epoch=50, validation_split=0.2)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;unlike machine learning model in which we can set a seed to get the result reproducible, here each time we rerun the model we get different result. In practice, we intentionally rerun the model many times to improve the model performance, and ones we get the best one we save it as follows:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;#save_model_hdf5(model, &amp;quot;modelcnn.h5&amp;quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;And we can load it again as follows:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;model &amp;lt;- load_model_hdf5(&amp;quot;modelcnn.h5&amp;quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The history object has all the necessary information such as the metric values for each epoch , so we can extract this informatiton to create a plot as follows.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;#train_loss &amp;lt;- history$metrics$loss
#valid_loss &amp;lt;- history$metrics$val_loss
#train_acc &amp;lt;- history$metrics$accuracy
#valid_acc &amp;lt;- history$metrics$val_accuracy
#epoch &amp;lt;- 1:50&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;#df &amp;lt;- tibble::tibble(epoch,train_loss,valid_loss,train_acc,valid_acc)&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;library(ggplot2)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## Warning: package &amp;#39;ggplot2&amp;#39; was built under R version 4.0.2&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;#p1 &amp;lt;- ggplot(df,aes(x=epoch, train_loss))+
 # geom_point(size=1, color=&amp;quot;blue&amp;quot;)+
#  geom_point(aes(x=epoch, valid_loss), size=1, color=&amp;quot;red&amp;quot;)+
 # ylab(&amp;quot;Loss&amp;quot;)
#ggsave(&amp;quot;plot_loss.jpg&amp;quot;, p1, device = &amp;quot;jpeg&amp;quot;, width = 20, height = 15, units = &amp;quot;cm&amp;quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;As you notice all the above codes have not been executed to avoide the issue discussed above and to make things simple. here we load the saved plot.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;par(mar=c(0,0,0,0))
plot(as.raster(readImage(&amp;quot;plot_loss.jpg&amp;quot;)))&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://modelingwithr.rbind.io/courses/cnn_imag/2020-04-25-cnn-imag_files/figure-html/unnamed-chunk-22-1.svg&#34; width=&#34;576&#34; /&gt;&lt;/p&gt;
&lt;p&gt;This plot shows the loss values for both the training set (in blue) and the validation set (in red), we see that the training loss consistently increases whereas the validation loss largely oscillating reflecting the less capability of the model to well predict the new unseen examples.
The same conclusion can be induced from the following plot for the accuracy metric.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;#p2 &amp;lt;- ggplot(df,aes(x=epoch, train_acc))+
  #geom_point(size=1, color=&amp;quot;blue&amp;quot;)+
  #geom_point(aes(x=epoch, valid_acc), size=1, color=&amp;quot;red&amp;quot;)+
 # ylab(&amp;quot;accuracy&amp;quot;)
#ggsave(&amp;quot;plot_acc.jpg&amp;quot;, p2, device = &amp;quot;jpeg&amp;quot;, width = 20, height = 15, units = &amp;quot;cm&amp;quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Here also we do the same thing&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;par(mar=c(0,0,0,0))
plot(as.raster(readImage(&amp;quot;plot_acc.jpg&amp;quot;)))&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://modelingwithr.rbind.io/courses/cnn_imag/2020-04-25-cnn-imag_files/figure-html/unnamed-chunk-24-1.svg&#34; width=&#34;576&#34; /&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Note&lt;/strong&gt;: we coud have used directly the plot function , plot(history), but doing so we will get different plot each time we knit the document.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;model-evaluation&#34; class=&#34;section level1&#34; number=&#34;4&#34;&gt;
&lt;h1&gt;&lt;span class=&#34;header-section-number&#34;&gt;4&lt;/span&gt; Model Evaluation&lt;/h1&gt;
&lt;p&gt;We can evaluate the model performance using the training set as follows:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;train_evaluate&amp;lt;- evaluate(model, trainall, trainy)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;With this first architecture we get a high accuracy rate &lt;strong&gt;95.24%&lt;/strong&gt; and the loss is &lt;strong&gt;0.0832&lt;/strong&gt;. However, you should be cautious when this rate is very high since it is computed from the training data which in many cases reflects the &lt;strong&gt;overfitting&lt;/strong&gt; problem&lt;a href=&#34;#fn2&#34; class=&#34;footnote-ref&#34; id=&#34;fnref2&#34;&gt;&lt;sup&gt;2&lt;/sup&gt;&lt;/a&gt;. The best evaluation thus is that based on the testing set.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;test_evaluate&amp;lt;- evaluate(model, testall, testy)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Using the testing set that is not seen by the model, the accuracy rate is about 55.56%.
In fact this is exactly what we warned about it, indeed we have an overfitting problem where the model try to memorize every noisy pattern which will constrain the model to poorly generalize to unseen data.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;prediction&#34; class=&#34;section level1&#34; number=&#34;5&#34;&gt;
&lt;h1&gt;&lt;span class=&#34;header-section-number&#34;&gt;5&lt;/span&gt; Prediction&lt;/h1&gt;
&lt;p&gt;We can get the predictions of the testing set as follows:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;pred &amp;lt;- predict_classes(model,testall)
pred&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## [1] 0 0 2 0 2 0 2 2 2&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;the following plot shows which images from the testing set are correctly classified and which are not:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;pred[pred==0] &amp;lt;- &amp;quot;cat&amp;quot;
pred[pred==1] &amp;lt;- &amp;quot;dog&amp;quot;
pred[pred==2] &amp;lt;- &amp;quot;lion&amp;quot;


par(mfrow=c(3,3))


foreach(i=1:9) %do% {display(test[[i]], method=&amp;quot;raster&amp;quot;);
  text(x = 20, y = 20, label = pred[i], 
       adj = c(0,1), col = &amp;quot;black&amp;quot;, cex = 4)
}&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://modelingwithr.rbind.io/courses/cnn_imag/2020-04-25-cnn-imag_files/figure-html/unnamed-chunk-28-1.svg&#34; width=&#34;576&#34; /&gt;&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;par(mfrow=c(1,1))&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Using this model to predict the test examples, all the dogs are misclassified whereas the lions are perfectly classified.&lt;/p&gt;
&lt;p&gt;We can also display the training examples as follows:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;pred1 &amp;lt;- predict_classes(model,trainall)

pred1[pred1==0] &amp;lt;- &amp;quot;cat&amp;quot;
pred1[pred1==1] &amp;lt;- &amp;quot;dog&amp;quot;
pred1[pred1==2] &amp;lt;- &amp;quot;lion&amp;quot;


par(mfrow=c(7,3))


foreach(i=1:21) %do% {display(train[[i]], method=&amp;quot;raster&amp;quot;);
  text(x = 20, y = 20, label = pred1[i], 
       adj = c(0,1), col = &amp;quot;black&amp;quot;, cex = 2)
}&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://modelingwithr.rbind.io/courses/cnn_imag/2020-04-25-cnn-imag_files/figure-html/unnamed-chunk-29-1.svg&#34; width=&#34;576&#34; /&gt;&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;par(mfrow=c(1,1))&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div id=&#34;conclusion&#34; class=&#34;section level1&#34; number=&#34;6&#34;&gt;
&lt;h1&gt;&lt;span class=&#34;header-section-number&#34;&gt;6&lt;/span&gt; Conclusion&lt;/h1&gt;
&lt;p&gt;As we see this model perfectly identified lions but failed to identify any of the dogs in the testing set which is not the case for the training data where the model has high accuracy, and as I mentioned earlier this is the consequences of the overfitting problem. However, There are bunch of techniques that can be used in such situation such as regularization methods (L2,L1), pooling, dropout layers..ect. All these techniques will be addressed soon in further articles.
Besides overfitting, we can also improve the model by playing around with hyperparameters such as changing , the number of the layers, number of the nodes in each layer, number of epochs…ect.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Note&lt;/strong&gt;: Be aware that this model can not be reliable since it has used very small data. However, we may get a higher performance for this model if we implement very large dataset.&lt;/p&gt;
&lt;p&gt;#Session information&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;sessionInfo()&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## R version 4.0.1 (2020-06-06)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 19041)
## 
## Matrix products: default
## 
## locale:
## [1] LC_COLLATE=English_United States.1252 
## [2] LC_CTYPE=English_United States.1252   
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C                          
## [5] LC_TIME=English_United States.1252    
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] ggplot2_3.3.2  foreach_1.5.0  keras_2.3.0.0  EBImage_4.30.0
## 
## loaded via a namespace (and not attached):
##  [1] reticulate_1.16     locfit_1.5-9.4      tidyselect_1.1.0   
##  [4] xfun_0.18           purrr_0.3.4         lattice_0.20-41    
##  [7] colorspace_1.4-1    generics_0.0.2      vctrs_0.3.4        
## [10] htmltools_0.5.0     yaml_2.2.1          base64enc_0.1-3    
## [13] rlang_0.4.7         pillar_1.4.6        withr_2.3.0        
## [16] glue_1.4.2          rappdirs_0.3.1      BiocGenerics_0.34.0
## [19] jpeg_0.1-8.1        lifecycle_0.2.0     tensorflow_2.2.0   
## [22] stringr_1.4.0       munsell_0.5.0       blogdown_0.20      
## [25] gtable_0.3.0        htmlwidgets_1.5.2   codetools_0.2-16   
## [28] evaluate_0.14       knitr_1.30          tfruns_1.4         
## [31] parallel_4.0.1      Rcpp_1.0.5          scales_1.1.1       
## [34] jsonlite_1.7.1      abind_1.4-5         png_0.1-7          
## [37] digest_0.6.25       stringi_1.5.3       tiff_0.1-5         
## [40] bookdown_0.20       dplyr_1.0.2         grid_4.0.1         
## [43] tools_4.0.1         bitops_1.0-6        magrittr_1.5       
## [46] RCurl_1.98-1.2      tibble_3.0.3        crayon_1.3.4       
## [49] whisker_0.4         pkgconfig_2.0.3     zeallot_0.1.0      
## [52] ellipsis_0.3.1      Matrix_1.2-18       rmarkdown_2.4      
## [55] iterators_1.0.12    R6_2.4.1            fftwtools_0.9-9    
## [58] compiler_4.0.1&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div class=&#34;footnotes&#34;&gt;
&lt;hr /&gt;
&lt;ol&gt;
&lt;li id=&#34;fn1&#34;&gt;&lt;p&gt;Francois chollet, Deep learning with R, Meap edition, 2017, P112 &lt;a href=&#34;#fnref1&#34; class=&#34;footnote-back&#34;&gt;↩︎&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li id=&#34;fn2&#34;&gt;&lt;p&gt;An introduction to statistical learning, Garth et al, spring, New York, page 33, &lt;a href=&#34;ISBN:978-1-4614-7173-0&#34; class=&#34;uri&#34;&gt;ISBN:978-1-4614-7173-0&lt;/a&gt;&lt;a href=&#34;#fnref2&#34; class=&#34;footnote-back&#34;&gt;↩︎&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;/div&gt;
</description>
    </item>
    
    <item>
      <title>Predicting large and imbalanced data set using the R package tidymodels</title>
      <link>https://modelingwithr.rbind.io/post/scania/predicting-large-and-imbalanced-data-set-using-the-r-package-tidymodels/</link>
      <pubDate>Tue, 14 Apr 2020 00:00:00 +0000</pubDate>
      <guid>https://modelingwithr.rbind.io/post/scania/predicting-large-and-imbalanced-data-set-using-the-r-package-tidymodels/</guid>
      <description>
&lt;script src=&#34;https://modelingwithr.rbind.io/rmarkdown-libs/header-attrs/header-attrs.js&#34;&gt;&lt;/script&gt;

&lt;div id=&#34;TOC&#34;&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&#34;#introduction&#34;&gt;Introduction&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#data-exploration&#34;&gt;Data exploration&lt;/a&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&#34;#summary-of-the-variables&#34;&gt;Summary of the variables&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#missing-values&#34;&gt;Missing values&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#imbalanced-data&#34;&gt;imbalanced data&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#building-the-recipe&#34;&gt;building the recipe&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#building-the-workflow&#34;&gt;Building the workflow&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#random-forest-model&#34;&gt;random forest model&lt;/a&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&#34;#model-training&#34;&gt;model training&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#model-evaluation&#34;&gt;model evaluation&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#model-tuning&#34;&gt;Model tuning:&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#logistic-regression-model&#34;&gt;logistic regression model&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#session-information&#34;&gt;Session information&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;

&lt;div id=&#34;introduction&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Introduction&lt;/h2&gt;
&lt;p&gt;The super easy way, at least for me, to deploy machine learning models is by making use of the R package &lt;strong&gt;tidymodels&lt;/strong&gt;, which is a collection of many packages that makes the workflow steps for your project very smooth and tightly connected to each other and easily managable in a well-structured manner.
The core packages contained in tidymodels are:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;rsample: for data splitting and rsampling.&lt;/li&gt;
&lt;li&gt;parsnip: Unified interface to the most common machine learning models.&lt;/li&gt;
&lt;li&gt;recipes: unified interface to the most common pre-processing tools for feature engineering.&lt;/li&gt;
&lt;li&gt;workflows: bundle the workflow steps together.&lt;/li&gt;
&lt;li&gt;tune: for optimization of the hyperparameters.&lt;/li&gt;
&lt;li&gt;yardstick: provides the most common performance metrics.&lt;/li&gt;
&lt;li&gt;broom: converts the outputs into user friendly formats such as tibble.&lt;/li&gt;
&lt;li&gt;dials: provides tools for parameter grids.&lt;/li&gt;
&lt;li&gt;infer: provides tools for statistical inferences.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;In addition to the above apackages tidymodels contains also some classical packages such as: dplyr, ggplot2, purrr, tibble. For more detail click &lt;a href=&#34;https://www.tidymodels.org&#34;&gt;here&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;In order to widely explore and understand the tidymodels, we should look for a noisy dataset that has large number of variables with missing values. Fortunately, I found an open source dataset that fulfils these requirements and in addition, it is highly imbalanced. This data is about &lt;strong&gt;scania trucks&lt;/strong&gt; and can be downloaded from &lt;a href=&#34;https://archive.ics.uci.edu/ml/datasets/APS+Failure+at+Scania+Trucks&#34;&gt;UCI machine learning repository&lt;/a&gt; with an extra file for its description.&lt;/p&gt;
&lt;p&gt;the target variable of this data is the air pressure system &lt;strong&gt;APS&lt;/strong&gt; in the truck that generates the pressurized air that are utilized in various function in the truck. It has two classes: positive &lt;strong&gt;pos&lt;/strong&gt; if a component failures due to a failures in the APS system, negative &lt;strong&gt;neg&lt;/strong&gt; if a component failures are not related to the APS system. This means that we are dealing with binary classification problem.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;data-exploration&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Data exploration&lt;/h2&gt;
&lt;p&gt;The data is already separated into training and testing set from the source, so let’s call the packages that we need and the data.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;ssh &amp;lt;- suppressPackageStartupMessages
ssh(library(readr))
ssh(library(caret))
ssh(library(themis))
ssh(library(tidymodels))
train &amp;lt;- read_csv(&amp;quot;https://archive.ics.uci.edu/ml/machine-learning-databases/00421/aps_failure_training_set.csv&amp;quot;, skip = 20)
test &amp;lt;- read_csv(&amp;quot;https://archive.ics.uci.edu/ml/machine-learning-databases/00421/aps_failure_test_set.csv&amp;quot;, skip = 20)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Notice that the data is a tibble where the first twenty rows are a mix of rows that contain some text descritpion and empty rows. and the 21th row contains the column names. That is why we have set the &lt;strong&gt;skip&lt;/strong&gt; argument equals to &lt;strong&gt;20&lt;/strong&gt;, and for the 21th column by default has been read as colnames &lt;code&gt;col_names = TRUE&lt;/code&gt;.&lt;/p&gt;
&lt;div id=&#34;summary-of-the-variables&#34; class=&#34;section level3&#34;&gt;
&lt;h3&gt;Summary of the variables&lt;/h3&gt;
&lt;p&gt;First let’s check the dimension of the two sets to be aware of what we are dealing with.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;dim(train)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;[1] 60000   171&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;dim(test)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;[1] 16000   171&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The training set has 60000 rows and 171 variables which is moderately large dataset. Inspecting thus this data by the usual functions such as &lt;strong&gt;summary&lt;/strong&gt;, &lt;strong&gt;str&lt;/strong&gt; would give heavy and not easily readable outputs. the best alternative However is by extracting the most important information that is required for building any machine learning model in aggregated way, for instance, what type of variables it has, some statistics about the variable values, the missing values..etc.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;map_chr(train, typeof) %&amp;gt;% 
  tibble() %&amp;gt;% 
  table()&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;.
character    double 
      170         1 &lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Strangely, all the variables but one are characters, which is in contradiction with the description of this data from the file description. To figure out what is going on we display some few rows and some few columns.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;train[1:5,1:7]&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;# A tibble: 5 x 7
  class aa_000 ab_000 ac_000     ad_000 ae_000 af_000
  &amp;lt;chr&amp;gt;  &amp;lt;dbl&amp;gt; &amp;lt;chr&amp;gt;  &amp;lt;chr&amp;gt;      &amp;lt;chr&amp;gt;  &amp;lt;chr&amp;gt;  &amp;lt;chr&amp;gt; 
1 neg    76698 na     2130706438 280    0      0     
2 neg    33058 na     0          na     0      0     
3 neg    41040 na     228        100    0      0     
4 neg       12 0      70         66     0      10    
5 neg    60874 na     1368       458    0      0     &lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;I think, the problem is that the missing values in the data indicated by &lt;strong&gt;na&lt;/strong&gt; are not recognized as missing values, instead they are treated as characters and this what makes the function &lt;strong&gt;read_csv&lt;/strong&gt; coerces every variable that has this &lt;strong&gt;na&lt;/strong&gt; values to character type. To fix this problem we can either go back and set the &lt;strong&gt;na&lt;/strong&gt; argument to &lt;strong&gt;“na”&lt;/strong&gt;, or we set the missing values by hand as follows.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;train[-1] &amp;lt;- train[-1] %&amp;gt;% 
  modify(~replace(., .==&amp;quot;na&amp;quot;, NA)) %&amp;gt;%
  modify(., as.double)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Now let’s check again&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;map_chr(train, typeof) %&amp;gt;% 
  tibble() %&amp;gt;% 
  table()&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;.
character    double 
        1       170 &lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The first column excluded above is our target variable &lt;strong&gt;class&lt;/strong&gt;. We should not forget to do the same transformation to the test set.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;test[-1] &amp;lt;- test[-1] %&amp;gt;% 
  modify(~replace(., .==&amp;quot;na&amp;quot;, NA)) %&amp;gt;%
  modify(., as.double)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;If we try to apply the &lt;strong&gt;summary&lt;/strong&gt; function on the entire variables (170), we will spent a lot of time to read the summary of each variable without much gain. Instead, we try to get an automated way to get only the information needed to build efficiently our model. To decide whether we should normalize the data or not, for instance, we display the standard deviances of all the variable in decreasing order.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Note&lt;/strong&gt;: with tree based models we do not need neither normalize the data nor converting factors to dummies.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;map_dbl(train[-1], sd, na.rm=TRUE) %&amp;gt;% 
  tibble(sd = .) %&amp;gt;% 
  arrange(-sd)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;# A tibble: 170 x 1
           sd
        &amp;lt;dbl&amp;gt;
 1 794874918.
 2  97484780.
 3  42746746.
 4  40404413.
 5  40404412.
 6  40404411.
 7  11567771.
 8  10886737.
 9  10859905.
10  10859904.
# ... with 160 more rows&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;We have very large variability, which means that the data should be normalized for any machine learning model that uses gradient descent or based on class distances.&lt;/p&gt;
&lt;p&gt;Another thing we can check is if some variabels have small number of unique values which can hence be converted to factor type.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;map(train[-1], unique) %&amp;gt;% 
  lengths(.) %&amp;gt;% 
  sort(.) %&amp;gt;% 
  head(5)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;cd_000 ch_000 as_000 ef_000 ab_000 
     2      3     22     29     30 &lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;To make things simple we consider only the first two ones to be converted to factor type.&lt;/p&gt;
&lt;p&gt;the first one is constant which is of type zero variance because its variance equals to zero, and the second one should be converted to factor type with two levels (for the two sets), but since it has large missing values we will decide about it later on. Notice that we do not apply theses transformations here because they will be combined at once with all the required transformations as what will be shown shortly.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;missing-values&#34; class=&#34;section level3&#34;&gt;
&lt;h3&gt;Missing values&lt;/h3&gt;
&lt;p&gt;The best way to deal with missing values depends on their number compared to the dataset size. if we have small number then it would be easier to completely remove them from the data, if in contrast we have large number then the best choice is to impute them using one of the common methods designed for this type of issue.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;dim(train[!complete.cases(train),])&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;[1] 59409   171&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;As we see almost every row contains at least one missing value in some columns. Let’s check the distribution of missing values within columns.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;df &amp;lt;- modify(train[-1], is.na) %&amp;gt;% 
  colSums() %&amp;gt;%
  tibble(names = colnames(train[-1]),missing_values=.) %&amp;gt;% 
  arrange(-missing_values)
  
df&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;# A tibble: 170 x 2
   names  missing_values
   &amp;lt;chr&amp;gt;           &amp;lt;dbl&amp;gt;
 1 br_000          49264
 2 bq_000          48722
 3 bp_000          47740
 4 bo_000          46333
 5 ab_000          46329
 6 cr_000          46329
 7 bn_000          44009
 8 bm_000          39549
 9 bl_000          27277
10 bk_000          23034
# ... with 160 more rows&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;I think the best strategy is to first remove columns that have a large number of missing values then we impute the rest, thereby we reduce the number of predictors and the number of missing values at ones. The following script keep the predictors that have a number of missing values less than &lt;strong&gt;10000&lt;/strong&gt;.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;names &amp;lt;- modify(train[-1], is.na) %&amp;gt;% 
  colSums() %&amp;gt;%
  tibble(names = colnames(train[-1]), missing_values=.) %&amp;gt;% 
  filter(missing_values &amp;lt; 10000) 
train1 &amp;lt;- train[c(&amp;quot;class&amp;quot;,names$names)]
test1 &amp;lt;- test[c(&amp;quot;class&amp;quot;,names$names)]&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;An important thing should be noted here is that, if we use imputation methods that use information from all other columns and/or rows to predict the current missing value, therefore the data must be first split between training and testing sets before any imputation, to abide by the crucial rule of machine learning: the test data should never be seen by the model during training process.
Fortunately, our data is already split so that the imputation can be done separately. However, the imputation methods will be implemented later on by the help of the &lt;strong&gt;recipes&lt;/strong&gt; package where we bundle all the pre-processing steps together.
&lt;strong&gt;Note&lt;/strong&gt;: the above &lt;strong&gt;ch_000&lt;/strong&gt; was removed since it did not fulfill the required threshold.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;imbalanced-data&#34; class=&#34;section level3&#34;&gt;
&lt;h3&gt;imbalanced data&lt;/h3&gt;
&lt;p&gt;Another important issue that we face when predicting this data is the &lt;strong&gt;imbalanced&lt;/strong&gt; problem.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;prop.table(table(train1$class))&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;
       neg        pos 
0.98333333 0.01666667 &lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This data is highly imbalanced, which tends to make even the worst machine learning model gives very high accuracy rate. In other words, if we do not use any model and predict every class as the largest class label (in our case negative) the accuracy rate will be approximately equal to the proportion of the largest class (in our case 98%), which is very big misleading result. Moreover, this misleading result can be catastrophic if we are more interested to predict the small class (in our case positive) such as detecting fraudulent credit cards. If you would like to get more detail about how to deal with imbalanced data please check this &lt;a href=&#34;https://modelingwithr.rbind.io/post/methods-to-deal-with-imbalanced-data/&#34;&gt;article&lt;/a&gt;.&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div id=&#34;building-the-recipe&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;building the recipe&lt;/h2&gt;
&lt;p&gt;Our initial model will be the &lt;strong&gt;random forest&lt;/strong&gt; wich is the most popular one . So the first step to build our model is by defining our model with the &lt;strong&gt;engine&lt;/strong&gt;, which is the method (or the package) used to fit this model, and the &lt;strong&gt;mode&lt;/strong&gt; with two possible values &lt;strong&gt;classification&lt;/strong&gt; or &lt;strong&gt;regression&lt;/strong&gt;. In our case, for instance, there exists two available engines: &lt;strong&gt;randomForest&lt;/strong&gt; or &lt;strong&gt;ranger&lt;/strong&gt;. Notice that the &lt;strong&gt;parsnip&lt;/strong&gt; package who provides these settings. For more detail about all the models available click &lt;a href=&#34;https://cran.r-project.org/web/packages/parsnip/parsnip.pdf&#34;&gt;here&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Note&lt;/strong&gt;: To speed up the computation process we restrict the forest to &lt;strong&gt;100&lt;/strong&gt; trees instead of the default &lt;strong&gt;500&lt;/strong&gt;.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;rf &amp;lt;- rand_forest(trees = 100) %&amp;gt;% 
  set_engine(&amp;quot;ranger&amp;quot;, num.threads=3, seed = 123) %&amp;gt;%
  set_mode(&amp;quot;classification&amp;quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Most machine learning models require pre-processed data with some feature engineering. Traditionally, R has (and some other packages such as dplyr and stringr) provides a wide range of functions such that we can do almost every kind of feature engineering. However, if we have many different transformations to perform then they will be done separately and it will be a little cumbersome to repeat the same scripts again for testing set for instance. Therefore, the &lt;strong&gt;recipes&lt;/strong&gt; package provides an easy way to combine all the transformations and other features related to the model, such as selecting the predictors that should be included, identifiers, …etc, as a single block that can be used for any other subset of the data.&lt;/p&gt;
&lt;p&gt;For our case we will apply the following transformations:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Imputing the missing values by the median of the corresponding variable since we have only numeric variables (for simplicity).&lt;/li&gt;
&lt;li&gt;removing variables that have zero variance (variable that has one unique value).&lt;/li&gt;
&lt;li&gt;removing highly correlated predictor using threshold &lt;strong&gt;0.8&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;Normalizing the data (even we do not need it in this model but we add this step since this recipe will be used with other models that use gradient decent or distances calculations).&lt;/li&gt;
&lt;li&gt;using the subsampling method &lt;strong&gt;smote&lt;/strong&gt; to create a balanced data.
Notice that the &lt;strong&gt;smote&lt;/strong&gt; method is provided by the package &lt;strong&gt;themis&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;To combine all these operations together we call the function &lt;strong&gt;recipe&lt;/strong&gt;.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;data_rec &amp;lt;- recipe(class~., data=train1) %&amp;gt;% 
  step_medianimpute(all_predictors() , seed_val = 111) %&amp;gt;% 
  step_zv(all_predictors()) %&amp;gt;% 
  step_corr(all_predictors(), threshold = 0.8) %&amp;gt;% 
  step_normalize(all_predictors()) %&amp;gt;%
  step_smote(class) &lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;As you see everything combined nicely and elegantly. However, this recipe transformed nothing yet, it just recorded the formula, the predictors and the transformations that should be applied. This means that we can update, at any time before fitting our model, the formula, add or remove some steps. The super interesting feature of recipe is that we can apply it to any other data (than that mentioned above, train) provided that has the same variable names. In case you want to apply these transformations to the training data use the &lt;strong&gt;prep&lt;/strong&gt; function, and to retrieve the results use the function &lt;strong&gt;juice&lt;/strong&gt;, and for other data use &lt;strong&gt;bake&lt;/strong&gt; after &lt;strong&gt;prep&lt;/strong&gt; to be able to apply some parameters from the training data, for instance, when we normalize the data this function lets us use the mean of predictors computed from the training data rather than from the testing data. However, in our case, we will combine everything until the model fitting step.&lt;br /&gt;
For more detail about all the steps available click &lt;a href=&#34;https://cran.r-project.org/web/packages/recipes/recipes.pdf&#34;&gt;here&lt;/a&gt;.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;building-the-workflow&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Building the workflow&lt;/h2&gt;
&lt;p&gt;To well organize our workflow in a structured and smoother way, we use the &lt;strong&gt;workflow&lt;/strong&gt; package that is one of the tidymodels collection.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;rf_wf &amp;lt;- workflow() %&amp;gt;% 
  add_model(rf) %&amp;gt;% 
  add_recipe(data_rec)
rf_wf&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;== Workflow =======================
Preprocessor: Recipe
Model: rand_forest()

-- Preprocessor -------------------
5 Recipe Steps

* step_medianimpute()
* step_zv()
* step_corr()
* step_normalize()
* step_smote()

-- Model --------------------------
Random Forest Model Specification (classification)

Main Arguments:
  trees = 100

Engine-Specific Arguments:
  num.threads = 3
  seed = 123

Computational engine: ranger &lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div id=&#34;random-forest-model&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;random forest model&lt;/h2&gt;
&lt;p&gt;Now we can run everything at once, the recipe and the model, notice that here we can also update, add or remove some elements before going ahead and fit the model.&lt;/p&gt;
&lt;div id=&#34;model-training&#34; class=&#34;section level3&#34;&gt;
&lt;h3&gt;model training&lt;/h3&gt;
&lt;p&gt;Everything now is ready to run our model with the default values.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;model_rf &amp;lt;- rf_wf %&amp;gt;% 
  fit(data = train1)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;We can extract the summary of this model as follows&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;model_rf %&amp;gt;% pull_workflow_fit()&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;parsnip model object

Fit time:  55.7s 
Ranger result

Call:
 ranger::ranger(formula = ..y ~ ., data = data, num.trees = ~100,      num.threads = ~3, seed = ~123, verbose = FALSE, probability = TRUE) 

Type:                             Probability estimation 
Number of trees:                  100 
Sample size:                      118000 
Number of independent variables:  95 
Mtry:                             9 
Target node size:                 10 
Variable importance mode:         none 
Splitrule:                        gini 
OOB prediction error (Brier s.):  0.003998112 &lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This model has created 100 trees and has chosen randomly 9 predictors with each tree. with these settings thus we do obtain very low oob error rate which is 0.4% (accuracy rate 99.6% ). However, be cautious with such high accuracy rate, since, in practice, This result may highly related to an overfitting problem. Last thing I want to mention about this output, by looking at the confusion matrix, is the fact that we have now balanced data.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;model-evaluation&#34; class=&#34;section level3&#34;&gt;
&lt;h3&gt;model evaluation&lt;/h3&gt;
&lt;p&gt;The best way to evaluate our model is by using the testing set. Notice that the &lt;strong&gt;yardstick&lt;/strong&gt; provides bunch of metrics to use, but let’s use the most popular one for classification problems &lt;strong&gt;accuracy&lt;/strong&gt;.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;model_rf %&amp;gt;% 
  predict( new_data = test1) %&amp;gt;% 
  bind_cols(test1[&amp;quot;class&amp;quot;]) %&amp;gt;% 
  accuracy(truth= as.factor(class), .pred_class) &lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;# A tibble: 1 x 3
  .metric  .estimator .estimate
  &amp;lt;chr&amp;gt;    &amp;lt;chr&amp;gt;          &amp;lt;dbl&amp;gt;
1 accuracy binary         0.990&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;with this model we get high accuracy which is very closer to the previous one. However, we should not forget that we are dealing with imbalanced data, and even though we have used subsampling methods (like smote method used here), they do not completely solve this issue, they can only minimize it at certain level and this is the reason why we have many of these methods. Therefore, it is better to use the confusion matrix from the &lt;strong&gt;caret&lt;/strong&gt; package since it gives more information.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;caret::confusionMatrix(as.factor(test1$class), predict(model_rf, new_data = test1)$.pred_class)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;Confusion Matrix and Statistics

          Reference
Prediction   neg   pos
       neg 15532    93
       pos    64   311
                                          
               Accuracy : 0.9902          
                 95% CI : (0.9885, 0.9917)
    No Information Rate : 0.9748          
    P-Value [Acc &amp;gt; NIR] : &amp;lt; 2e-16         
                                          
                  Kappa : 0.7934          
                                          
 Mcnemar&amp;#39;s Test P-Value : 0.02544         
                                          
            Sensitivity : 0.9959          
            Specificity : 0.7698          
         Pos Pred Value : 0.9940          
         Neg Pred Value : 0.8293          
             Prevalence : 0.9748          
         Detection Rate : 0.9708          
   Detection Prevalence : 0.9766          
      Balanced Accuracy : 0.8828          
                                          
       &amp;#39;Positive&amp;#39; Class : neg             
                                          &lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;As said shortly, the specificity rate related to the minor class &lt;strong&gt;78%&lt;/strong&gt; is very low compared to the major class &lt;strong&gt;99%&lt;/strong&gt;, and You can think of this as a partial overfitting towards the major class. So if we are more interested to the minor class (which is often the case) then we have go back to our model and try tuning our model or try another subsampling method.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;model-tuning&#34; class=&#34;section level3&#34;&gt;
&lt;h3&gt;Model tuning:&lt;/h3&gt;
&lt;p&gt;For model tuning we try other values for some arguments rather than the default vaues. and leave the tuning for some others to the &lt;strong&gt;dials&lt;/strong&gt; package. So let’s try the following argument values:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;num.trees = 100. The default is 500.&lt;/li&gt;
&lt;li&gt;num.threads = 3. The default is 1.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;And tune the following:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;mtry = tune(). The default is square root of the number of the variables.&lt;/li&gt;
&lt;li&gt;min_n = tune(). The default is 1.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;First, we define the model with these new arguments.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;model_tune &amp;lt;- rand_forest(trees= 100, mtry=tune(), min_n = tune()) %&amp;gt;%
  set_engine(&amp;quot;ranger&amp;quot;, num.threads=3, seed=123) %&amp;gt;% 
  set_mode(&amp;quot;classification&amp;quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Since in grid search the two arguments mtry and min_n are data dependent, then we should at least specify their ranges.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;grid &amp;lt;- grid_regular(mtry(range = c(9,15)), min_n(range = c(5,40)), levels = 3)
grid&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;# A tibble: 9 x 2
   mtry min_n
  &amp;lt;int&amp;gt; &amp;lt;int&amp;gt;
1     9     5
2    12     5
3    15     5
4     9    22
5    12    22
6    15    22
7     9    40
8    12    40
9    15    40&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;By setting the levels equal to 3 we get 9 combinations and hence 9 models will be trained.
The above recipe has steps that should not be repeated many times when tuning the model, we apply therefore the recipe to the training data in order to get the transformed data, and do not forget to apply the recipe to the testing data.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;train2 &amp;lt;- prep(data_rec) %&amp;gt;% 
  juice()
test2 &amp;lt;- prep(data_rec) %&amp;gt;% 
  bake(test1)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;To tune our model we use cross validation technique. since we have large data set we use only 3 folds.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;set.seed(111)
fold &amp;lt;- vfold_cv(train2, v = 3, strata = class)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Now we bundle our recipe with the specified model.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;tune_wf &amp;lt;- workflow() %&amp;gt;% 
  add_model(model_tune) %&amp;gt;%
  add_formula(class~.)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;To fit these models across the folds we use the &lt;strong&gt;tune_grid&lt;/strong&gt; function instead of &lt;strong&gt;fit&lt;/strong&gt;.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;tune_rf &amp;lt;- tune_wf %&amp;gt;% 
  tune_grid(resamples = fold, grid = grid)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;For classification problems this function uses two metrics: accuracy and area under the ROC curve. SO we can extract the metric values as follows.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;results &amp;lt;- tune_rf %&amp;gt;% collect_metrics()&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;To get the best model we have to choose one of the two metrics, so let’s go ahead with the accuracy rate.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;best_param &amp;lt;- 
  tune_rf %&amp;gt;% select_best(metric = &amp;quot;accuracy&amp;quot;)
best_param&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;# A tibble: 1 x 3
   mtry min_n .config
  &amp;lt;int&amp;gt; &amp;lt;int&amp;gt; &amp;lt;chr&amp;gt;  
1    15     5 Model3 &lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;we can finalize the workflow with the new parameter values.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;tune_wf2 &amp;lt;- tune_wf %&amp;gt;% 
  finalize_workflow(best_param)
tune_wf2&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;== Workflow =======================
Preprocessor: Formula
Model: rand_forest()

-- Preprocessor -------------------
class ~ .

-- Model --------------------------
Random Forest Model Specification (classification)

Main Arguments:
  mtry = 15
  trees = 100
  min_n = 5

Engine-Specific Arguments:
  num.threads = 3
  seed = 123

Computational engine: ranger &lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Now we fit the model with the best parameter values to the entire training data.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;best_model &amp;lt;- tune_wf2 %&amp;gt;% 
  fit(train2)
best_model&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;== Workflow [trained] =============
Preprocessor: Formula
Model: rand_forest()

-- Preprocessor -------------------
class ~ .

-- Model --------------------------
Ranger result

Call:
 ranger::ranger(formula = ..y ~ ., data = data, mtry = ~15L, num.trees = ~100,      min.node.size = ~5L, num.threads = ~3, seed = ~123, verbose = FALSE,      probability = TRUE) 

Type:                             Probability estimation 
Number of trees:                  100 
Sample size:                      118000 
Number of independent variables:  95 
Mtry:                             15 
Target node size:                 5 
Variable importance mode:         none 
Splitrule:                        gini 
OOB prediction error (Brier s.):  0.00359659 &lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Let’s get the confusion matrix&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;caret::confusionMatrix(as.factor(test2$class), predict(best_model, new_data = test2)$.pred_class)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;Confusion Matrix and Statistics

          Reference
Prediction   neg   pos
       neg 15538    87
       pos    67   308
                                          
               Accuracy : 0.9904          
                 95% CI : (0.9887, 0.9918)
    No Information Rate : 0.9753          
    P-Value [Acc &amp;gt; NIR] : &amp;lt;2e-16          
                                          
                  Kappa : 0.7951          
                                          
 Mcnemar&amp;#39;s Test P-Value : 0.1258          
                                          
            Sensitivity : 0.9957          
            Specificity : 0.7797          
         Pos Pred Value : 0.9944          
         Neg Pred Value : 0.8213          
             Prevalence : 0.9753          
         Detection Rate : 0.9711          
   Detection Prevalence : 0.9766          
      Balanced Accuracy : 0.8877          
                                          
       &amp;#39;Positive&amp;#39; Class : neg             
                                          &lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;As we see we do not get any improvement for the specificity rate. so let’s try another subsampling method, say &lt;strong&gt;Rose&lt;/strong&gt; method.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;rf_rose &amp;lt;- rand_forest(trees = 100, mtry=9, min_n = 5) %&amp;gt;% 
  set_engine(&amp;quot;ranger&amp;quot;, num.threads=3, seed = 123) %&amp;gt;%
  set_mode(&amp;quot;classification&amp;quot;)
data_rec2 &amp;lt;- recipe(class~., data=train1) %&amp;gt;% 
  step_medianimpute(all_predictors() , seed_val = 111) %&amp;gt;% 
  step_zv(all_predictors()) %&amp;gt;% 
  step_corr(all_predictors(), threshold = 0.8) %&amp;gt;% 
  step_normalize(all_predictors()) %&amp;gt;%
  step_rose(class) 
rf_rose_wf &amp;lt;- workflow() %&amp;gt;% 
  add_model(rf_rose) %&amp;gt;% 
  add_recipe(data_rec2)
model_rose_rf &amp;lt;- rf_rose_wf %&amp;gt;% 
  fit(data = train1)
caret::confusionMatrix(as.factor(test1$class), predict(model_rose_rf, new_data = test1)$.pred_class)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;Confusion Matrix and Statistics

          Reference
Prediction   neg   pos
       neg 15522   103
       pos   140   235
                                          
               Accuracy : 0.9848          
                 95% CI : (0.9828, 0.9867)
    No Information Rate : 0.9789          
    P-Value [Acc &amp;gt; NIR] : 2.437e-08       
                                          
                  Kappa : 0.6514          
                                          
 Mcnemar&amp;#39;s Test P-Value : 0.02092         
                                          
            Sensitivity : 0.9911          
            Specificity : 0.6953          
         Pos Pred Value : 0.9934          
         Neg Pred Value : 0.6267          
             Prevalence : 0.9789          
         Detection Rate : 0.9701          
   Detection Prevalence : 0.9766          
      Balanced Accuracy : 0.8432          
                                          
       &amp;#39;Positive&amp;#39; Class : neg             
                                          &lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The rose method is much worse than smote method since the specificity rate has doped down to 69%.&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div id=&#34;logistic-regression-model&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;logistic regression model&lt;/h2&gt;
&lt;p&gt;The logistic regression is another model to fit data with binary outcome. As before we use the first recipe with smote method.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;logit &amp;lt;- logistic_reg() %&amp;gt;% 
  set_engine(&amp;quot;glm&amp;quot;) %&amp;gt;%
  set_mode(&amp;quot;classification&amp;quot;)

logit_wf &amp;lt;- workflow() %&amp;gt;% 
  add_model(logit) %&amp;gt;% 
  add_recipe(data_rec)

set.seed(123)
model_logit &amp;lt;- logit_wf %&amp;gt;% 
  fit(data = train1)

caret::confusionMatrix(as.factor(test1$class), predict(model_logit, new_data = test1)$.pred_class)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;Confusion Matrix and Statistics

          Reference
Prediction   neg   pos
       neg 15327   298
       pos    59   316
                                          
               Accuracy : 0.9777          
                 95% CI : (0.9753, 0.9799)
    No Information Rate : 0.9616          
    P-Value [Acc &amp;gt; NIR] : &amp;lt; 2.2e-16       
                                          
                  Kappa : 0.6282          
                                          
 Mcnemar&amp;#39;s Test P-Value : &amp;lt; 2.2e-16       
                                          
            Sensitivity : 0.9962          
            Specificity : 0.5147          
         Pos Pred Value : 0.9809          
         Neg Pred Value : 0.8427          
             Prevalence : 0.9616          
         Detection Rate : 0.9579          
   Detection Prevalence : 0.9766          
      Balanced Accuracy : 0.7554          
                                          
       &amp;#39;Positive&amp;#39; Class : neg             
                                          &lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;with this model we do not get better rate for minority class than random forest model.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;session-information&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Session information&lt;/h2&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;sessionInfo()&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;R version 4.0.1 (2020-06-06)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19041)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252 
[2] LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] yardstick_0.0.7  workflows_0.2.0  tune_0.1.1       tidyr_1.1.2     
 [5] tibble_3.0.3     rsample_0.0.8    purrr_0.3.4      parsnip_0.1.3   
 [9] modeldata_0.0.2  infer_0.5.3      dials_0.0.9      scales_1.1.1    
[13] broom_0.7.1      tidymodels_0.1.1 themis_0.1.2     recipes_0.1.13  
[17] dplyr_1.0.2      caret_6.0-86     ggplot2_3.3.2    lattice_0.20-41 
[21] readr_1.3.1     

loaded via a namespace (and not attached):
 [1] nlme_3.1-149         lubridate_1.7.9      doParallel_1.0.15   
 [4] DiceDesign_1.8-1     tools_4.0.1          backports_1.1.10    
 [7] utf8_1.1.4           R6_2.4.1             rpart_4.1-15        
[10] colorspace_1.4-1     nnet_7.3-14          withr_2.3.0         
[13] prettyunits_1.1.1    tidyselect_1.1.0     curl_4.3            
[16] compiler_4.0.1       parallelMap_1.5.0    cli_2.0.2           
[19] bookdown_0.20        checkmate_2.0.0      stringr_1.4.0       
[22] digest_0.6.25        rmarkdown_2.4        unbalanced_2.0      
[25] pkgconfig_2.0.3      htmltools_0.5.0      lhs_1.1.0           
[28] rlang_0.4.7          rstudioapi_0.11      BBmisc_1.11         
[31] FNN_1.1.3            generics_0.0.2       ModelMetrics_1.2.2.2
[34] magrittr_1.5         ROSE_0.0-3           Matrix_1.2-18       
[37] fansi_0.4.1          Rcpp_1.0.5           munsell_0.5.0       
[40] GPfit_1.0-8          lifecycle_0.2.0      furrr_0.1.0         
[43] stringi_1.5.3        pROC_1.16.2          yaml_2.2.1          
[46] MASS_7.3-53          plyr_1.8.6           grid_4.0.1          
[49] parallel_4.0.1       listenv_0.8.0        crayon_1.3.4        
[52] splines_4.0.1        hms_0.5.3            knitr_1.30          
[55] mlr_2.17.1           pillar_1.4.6         ranger_0.12.1       
[58] reshape2_1.4.4       codetools_0.2-16     stats4_4.0.1        
[61] fastmatch_1.1-0      glue_1.4.2           evaluate_0.14       
[64] ParamHelpers_1.14    blogdown_0.20        data.table_1.13.0   
[67] vctrs_0.3.4          foreach_1.5.0        gtable_0.3.0        
[70] RANN_2.6.1           future_1.19.1        assertthat_0.2.1    
[73] xfun_0.18            gower_0.2.2          prodlim_2019.11.13  
[76] e1071_1.7-3          class_7.3-17         survival_3.2-7      
[79] timeDate_3043.102    iterators_1.0.12     hardhat_0.1.4       
[82] lava_1.6.8           globals_0.13.0       ellipsis_0.3.1      
[85] ipred_0.9-9         &lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
</description>
    </item>
    
    <item>
      <title>Count data Models</title>
      <link>https://modelingwithr.rbind.io/post/count_data/count-data-models/</link>
      <pubDate>Mon, 06 Jan 2020 00:00:00 +0000</pubDate>
      <guid>https://modelingwithr.rbind.io/post/count_data/count-data-models/</guid>
      <description>
&lt;script src=&#34;https://modelingwithr.rbind.io/rmarkdown-libs/header-attrs/header-attrs.js&#34;&gt;&lt;/script&gt;

&lt;div id=&#34;TOC&#34;&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&#34;#introduction&#34;&gt;&lt;span class=&#34;toc-section-number&#34;&gt;1&lt;/span&gt; Introduction:&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#data-preparation&#34;&gt;&lt;span class=&#34;toc-section-number&#34;&gt;2&lt;/span&gt; Data preparation&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#data-partition&#34;&gt;&lt;span class=&#34;toc-section-number&#34;&gt;3&lt;/span&gt; Data partition&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#poisson-model&#34;&gt;&lt;span class=&#34;toc-section-number&#34;&gt;4&lt;/span&gt; Poisson model&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#quasi-poisson-model&#34;&gt;&lt;span class=&#34;toc-section-number&#34;&gt;5&lt;/span&gt; Quasi poisson model&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#negative-binomial-model&#34;&gt;&lt;span class=&#34;toc-section-number&#34;&gt;6&lt;/span&gt; Negative binomial model&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#hurdle-model&#34;&gt;&lt;span class=&#34;toc-section-number&#34;&gt;7&lt;/span&gt; Hurdle model&lt;/a&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&#34;#hurdle-model-with-poisson-distribution.&#34;&gt;&lt;span class=&#34;toc-section-number&#34;&gt;7.1&lt;/span&gt; hurdle model with poisson distribution.&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#hurdle-model-with-negative-binomial-distribution.&#34;&gt;&lt;span class=&#34;toc-section-number&#34;&gt;7.2&lt;/span&gt; hurdle model with negative binomial distribution.&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#zero-inflated-model&#34;&gt;&lt;span class=&#34;toc-section-number&#34;&gt;8&lt;/span&gt; Zero inflated model&lt;/a&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&#34;#zero-inflated-model-with-poisson-distribution&#34;&gt;&lt;span class=&#34;toc-section-number&#34;&gt;8.1&lt;/span&gt; Zero inflated model with poisson distribution&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#zero-inflated-model-with-negative-binomial-distribution&#34;&gt;&lt;span class=&#34;toc-section-number&#34;&gt;8.2&lt;/span&gt; Zero inflated model with negative binomial distribution&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#conclusion&#34;&gt;&lt;span class=&#34;toc-section-number&#34;&gt;9&lt;/span&gt; Conclusion:&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#furhter-reading&#34;&gt;&lt;span class=&#34;toc-section-number&#34;&gt;10&lt;/span&gt; Furhter reading:&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#session-info&#34;&gt;&lt;span class=&#34;toc-section-number&#34;&gt;11&lt;/span&gt; Session info&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;

&lt;style type=&#34;text/css&#34;&gt;
strong {
  color: Navy;
}

h1,h2, h3, h4 {
  font-size:28px;
  color:DarkBlue;
}
&lt;/style&gt;
&lt;div id=&#34;introduction&#34; class=&#34;section level1&#34; number=&#34;1&#34;&gt;
&lt;h1&gt;&lt;span class=&#34;header-section-number&#34;&gt;1&lt;/span&gt; Introduction:&lt;/h1&gt;
&lt;p&gt;When we deal with data that has a response variable of integer type, using a linear regression may violate the normality assumption and hence all the classical statistic tests would fail to evaluate the model. However, as we do with logistic regression models, the generalized linear model &lt;a href=&#34;https://en.wikipedia.org/wiki/Generalized_linear_model&#34;&gt;GLM&lt;/a&gt; can be used instead here by specifying the suitable distribution.&lt;/p&gt;
&lt;p&gt;The possible distributions for this type of data are the discrete distributions &lt;a href=&#34;https://en.wikipedia.org/wiki/Poisson_distribution&#34;&gt;poisson&lt;/a&gt; and &lt;a href=&#34;https://en.wikipedia.org/wiki/Negative_binomial_distribution&#34;&gt;negative binomial&lt;/a&gt;. The former is the best choice if the mean and the variance of the response variable are closer to each other, if they are not however and we persist using this distribution we may cause the rise of the &lt;a href=&#34;https://en.wikipedia.org/wiki/Overdispersion&#34;&gt;overdispersion&lt;/a&gt; problem of the residuals. As a solution thus, we can use the latter distribution that does not have this restriction.&lt;/p&gt;
&lt;p&gt;There is another alternative if neither the poisson distribution nor the negative binomial are suitable called the &lt;a href=&#34;https://en.wikipedia.org/wiki/Quasi-maximum_likelihood_estimate&#34;&gt;Quasi maximum likelihood&lt;/a&gt;. The advantage of this method is that uses only the relationship between the mean and the variance and does not require any prespecified distribution. Moreover, its estimators are approximately as efficient as the maximum likelihood estimators.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;data-preparation&#34; class=&#34;section level1&#34; number=&#34;2&#34;&gt;
&lt;h1&gt;&lt;span class=&#34;header-section-number&#34;&gt;2&lt;/span&gt; Data preparation&lt;/h1&gt;
&lt;p&gt;To well understand how to model the count data we are going be using &lt;strong&gt;Doctorvisits&lt;/strong&gt; data from &lt;strong&gt;AER&lt;/strong&gt; package, in which the variable &lt;strong&gt;visits&lt;/strong&gt; will be our target variable, so let’s call this data with the packages that we need along this article.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;ssh &amp;lt;- suppressPackageStartupMessages
ssh(library(performance))&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;Warning: package &amp;#39;performance&amp;#39; was built under R version 4.0.2&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;ssh(library(ModelMetrics))
ssh(library(corrr))
ssh(library(purrr))
ssh(library(MASS))&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;Warning: package &amp;#39;MASS&amp;#39; was built under R version 4.0.2&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;ssh(library(tidyverse))&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;Warning: package &amp;#39;ggplot2&amp;#39; was built under R version 4.0.2&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;Warning: package &amp;#39;tibble&amp;#39; was built under R version 4.0.2&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;Warning: package &amp;#39;tidyr&amp;#39; was built under R version 4.0.2&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;Warning: package &amp;#39;dplyr&amp;#39; was built under R version 4.0.2&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;ssh(library(AER))&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;Warning: package &amp;#39;car&amp;#39; was built under R version 4.0.2&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;Warning: package &amp;#39;lmtest&amp;#39; was built under R version 4.0.2&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;Warning: package &amp;#39;sandwich&amp;#39; was built under R version 4.0.2&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;Warning: package &amp;#39;survival&amp;#39; was built under R version 4.0.2&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;ssh(library(broom))&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;Warning: package &amp;#39;broom&amp;#39; was built under R version 4.0.2&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;data(&amp;quot;DoctorVisits&amp;quot;)
doc &amp;lt;- DoctorVisits
glimpse(doc)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;Rows: 5,190
Columns: 12
$ visits    &amp;lt;dbl&amp;gt; 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 2, 1, 2, ...
$ gender    &amp;lt;fct&amp;gt; female, female, male, male, male, female, female, female,...
$ age       &amp;lt;dbl&amp;gt; 0.19, 0.19, 0.19, 0.19, 0.19, 0.19, 0.19, 0.19, 0.19, 0.1...
$ income    &amp;lt;dbl&amp;gt; 0.55, 0.45, 0.90, 0.15, 0.45, 0.35, 0.55, 0.15, 0.65, 0.1...
$ illness   &amp;lt;dbl&amp;gt; 1, 1, 3, 1, 2, 5, 4, 3, 2, 1, 1, 2, 3, 4, 3, 2, 1, 1, 1, ...
$ reduced   &amp;lt;dbl&amp;gt; 4, 2, 0, 0, 5, 1, 0, 0, 0, 0, 0, 0, 13, 7, 1, 0, 0, 1, 0,...
$ health    &amp;lt;dbl&amp;gt; 1, 1, 0, 0, 1, 9, 2, 6, 5, 0, 0, 2, 1, 6, 0, 7, 5, 0, 0, ...
$ private   &amp;lt;fct&amp;gt; yes, yes, no, no, no, no, no, no, yes, yes, no, no, no, n...
$ freepoor  &amp;lt;fct&amp;gt; no, no, no, no, no, no, no, no, no, no, no, no, no, no, n...
$ freerepat &amp;lt;fct&amp;gt; no, no, no, no, no, no, no, no, no, no, no, yes, no, no, ...
$ nchronic  &amp;lt;fct&amp;gt; no, no, no, no, yes, yes, no, no, no, no, no, no, yes, ye...
$ lchronic  &amp;lt;fct&amp;gt; no, no, no, no, no, no, no, no, no, no, no, no, no, no, n...&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This data from Australian health survey where &lt;strong&gt;visits&lt;/strong&gt; is the number of doctor visits in past two weeks with 11 features listed above.&lt;/p&gt;
&lt;p&gt;First we list the summary of the data to inspect any unwanted issue.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;summary(doc)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;     visits          gender          age             income      
 Min.   :0.0000   male  :2488   Min.   :0.1900   Min.   :0.0000  
 1st Qu.:0.0000   female:2702   1st Qu.:0.2200   1st Qu.:0.2500  
 Median :0.0000                 Median :0.3200   Median :0.5500  
 Mean   :0.3017                 Mean   :0.4064   Mean   :0.5832  
 3rd Qu.:0.0000                 3rd Qu.:0.6200   3rd Qu.:0.9000  
 Max.   :9.0000                 Max.   :0.7200   Max.   :1.5000  
    illness         reduced            health       private    freepoor  
 Min.   :0.000   Min.   : 0.0000   Min.   : 0.000   no :2892   no :4968  
 1st Qu.:0.000   1st Qu.: 0.0000   1st Qu.: 0.000   yes:2298   yes: 222  
 Median :1.000   Median : 0.0000   Median : 0.000                        
 Mean   :1.432   Mean   : 0.8619   Mean   : 1.218                        
 3rd Qu.:2.000   3rd Qu.: 0.0000   3rd Qu.: 2.000                        
 Max.   :5.000   Max.   :14.0000   Max.   :12.000                        
 freerepat  nchronic   lchronic  
 no :4099   no :3098   no :4585  
 yes:1091   yes:2092   yes: 605  
                                 
                                 
                                 
                                 &lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;As we see we do not have missing values and the visits values ranges from 0 to 9 but it should be of integer type rather than double. Similarly, the variable &lt;strong&gt;illness&lt;/strong&gt; should be converted to factor type since it has a few different values.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;doc$visits&amp;lt;-as.integer(doc$visits)
doc$illness &amp;lt;- as.factor(doc$illness)
tab &amp;lt;- table(doc$visits)
tab&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;
   0    1    2    3    4    5    6    7    8    9 
4141  782  174   30   24    9   12   12    5    1 &lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The best thing we do to start analyzing the data is by displaying the &lt;strong&gt;correlation coefficient&lt;/strong&gt; of each pair variables we have. Thus, any particular predictor that has high correlation with the target variable could be highly likely to be relevant in our future model. Notice that our target variable is not continuous hence we will use the &lt;a href=&#34;https://en.wikipedia.org/wiki/Spearman%27s_rank_correlation_coefficient&#34;&gt;spearman correlation&lt;/a&gt;. As required by &lt;strong&gt;correlate&lt;/strong&gt; function from &lt;strong&gt;corrr&lt;/strong&gt; package, all the variables must be of numeric type so we convert all the factor to integer.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;doc1 &amp;lt;-modify_if(doc, is.factor, as.integer)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;notice that we have stored the result in another object &lt;strong&gt;doc1&lt;/strong&gt; to keep save our original data.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;M &amp;lt;- correlate(doc1, method=&amp;quot;spearman&amp;quot;)
rplot(shave(M), colours=c(&amp;quot;red&amp;quot;, &amp;quot;white&amp;quot;, &amp;quot;blue&amp;quot; ))+
   theme(axis.text.x = element_text(angle = 90, hjust = 1))&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://modelingwithr.rbind.io/post/count_data/2020-01-06-count-data-models_files/figure-html/unnamed-chunk-6-1.svg&#34; width=&#34;576&#34; /&gt;&lt;/p&gt;
&lt;p&gt;Looking at this plot all the correlations has low values. however, these correlations assess only the monotonic relations, they say nothing about any other form of relation.&lt;br /&gt;
First let’s compare the empirical distribution of the variable &lt;strong&gt;visits&lt;/strong&gt; and the theoretical poisson distribution with &lt;span class=&#34;math inline&#34;&gt;\(\lambda\)&lt;/span&gt; equals the visits mean 0.3017341, and the total number of observations is 5190.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;pos &amp;lt;- dpois(0:9,0.302)*5190
both &amp;lt;- numeric(20)
both[1:20 %% 2 != 0] &amp;lt;- tab
both[1:20 %% 2 == 0] &amp;lt;- pos
labels&amp;lt;-character(20)
labels[1:20 %% 2==0]&amp;lt;-as.character(0:9)
barplot(both,col=rep(c(&amp;quot;red&amp;quot;,&amp;quot;yellow&amp;quot;),10),names=labels)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://modelingwithr.rbind.io/post/count_data/2020-01-06-count-data-models_files/figure-html/unnamed-chunk-7-1.svg&#34; width=&#34;576&#34; /&gt;&lt;/p&gt;
&lt;p&gt;As we see the two distributions are more or less closer to each other.
Let’s now check the negative binomial distribution by first estimate the &lt;a href=&#34;https://influentialpoints.com/Training/negative_binomial_distribution-principles-properties-assumptions.htm&#34;&gt;clumping parameter&lt;/a&gt; &lt;span class=&#34;math inline&#34;&gt;\(k=\frac{\bar x^2}{s^2-\bar x}\)&lt;/span&gt;.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;k&amp;lt;-mean(doc$visits)^2/(var(doc$visits)-mean(doc$visits))
bin&amp;lt;-dnbinom(0:9,0.27,mu=0.302)*5190
both1&amp;lt;-numeric(20)
both1[1:20 %% 2 != 0]&amp;lt;-tab
both1[1:20 %% 2 == 0]&amp;lt;-bin
labels&amp;lt;-character(20)
labels[1:20 %% 2==0]&amp;lt;-as.character(0:9)
barplot(both1,col=rep(c(&amp;quot;red&amp;quot;,&amp;quot;yellow&amp;quot;),10),names=labels)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://modelingwithr.rbind.io/post/count_data/2020-01-06-count-data-models_files/figure-html/unnamed-chunk-8-1.svg&#34; width=&#34;576&#34; /&gt;&lt;/p&gt;
&lt;p&gt;With this distribution it seems that the empiricall distribution is more closer to the negative binomial than the poisson distribution.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Note&lt;/strong&gt;: This data has very large number of zeros for the outcome compared to the other values which means that any trained model that does not take into account this anomaly will be biased to predict more likely the &lt;strong&gt;zero&lt;/strong&gt; value. However, at the end of this article I will show two famous models to handel this type of count data called &lt;strong&gt;Haurdle&lt;/strong&gt; model and &lt;strong&gt;zero_inflated&lt;/strong&gt; model.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;data-partition&#34; class=&#34;section level1&#34; number=&#34;3&#34;&gt;
&lt;h1&gt;&lt;span class=&#34;header-section-number&#34;&gt;3&lt;/span&gt; Data partition&lt;/h1&gt;
&lt;p&gt;In oreder to evaluate our model we held out 20% of the data as testing set.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;set.seed(123)
index&amp;lt;-sample(2,nrow(doc),replace = TRUE,p=c(.8,.2))
train&amp;lt;-doc[index==1,]
test&amp;lt;-doc[index==2,]&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div id=&#34;poisson-model&#34; class=&#34;section level1&#34; number=&#34;4&#34;&gt;
&lt;h1&gt;&lt;span class=&#34;header-section-number&#34;&gt;4&lt;/span&gt; Poisson model&lt;/h1&gt;
&lt;p&gt;This model belongs to the generalized linear model families, so in the function &lt;strong&gt;glm&lt;/strong&gt; we set the argument &lt;strong&gt;family&lt;/strong&gt; to poisson. In practice this model is sufficient with a wide range of count data.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;set.seed(123)
model1&amp;lt;-glm(visits~., data=train, family =&amp;quot;poisson&amp;quot;)
tidy(model1)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;# A tibble: 16 x 5
   term         estimate std.error statistic   p.value
   &amp;lt;chr&amp;gt;           &amp;lt;dbl&amp;gt;     &amp;lt;dbl&amp;gt;     &amp;lt;dbl&amp;gt;     &amp;lt;dbl&amp;gt;
 1 (Intercept)   -2.70     0.141     -19.2   9.14e- 82
 2 genderfemale   0.193    0.0620      3.11  1.88e-  3
 3 age            0.436    0.184       2.37  1.77e-  2
 4 income        -0.161    0.0928     -1.74  8.23e-  2
 5 illness1       0.944    0.113       8.35  6.76e- 17
 6 illness2       1.21     0.118      10.3   1.15e- 24
 7 illness3       1.11     0.132       8.43  3.51e- 17
 8 illness4       1.28     0.140       9.13  7.14e- 20
 9 illness5       1.44     0.139      10.4   2.34e- 25
10 reduced        0.126    0.00560    22.6   6.85e-113
11 health         0.0348   0.0112      3.10  1.91e-  3
12 privateyes     0.111    0.0795      1.39  1.64e-  1
13 freepooryes   -0.344    0.190      -1.81  7.00e-  2
14 freerepatyes   0.0377   0.104       0.363 7.16e-  1
15 nchronicyes    0.0186   0.0732      0.254 7.99e-  1
16 lchronicyes    0.0255   0.0916      0.279 7.81e-  1&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;As we see all the variables are significant except for the income so we remove this variable and reestimate again.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;set.seed(123)
model1&amp;lt;-glm(visits~.-income, data=train, family =&amp;quot;poisson&amp;quot;)
tidy(model1)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;# A tibble: 15 x 5
   term         estimate std.error statistic   p.value
   &amp;lt;chr&amp;gt;           &amp;lt;dbl&amp;gt;     &amp;lt;dbl&amp;gt;     &amp;lt;dbl&amp;gt;     &amp;lt;dbl&amp;gt;
 1 (Intercept)   -2.83     0.121     -23.4   7.36e-121
 2 genderfemale   0.213    0.0609      3.51  4.56e-  4
 3 age            0.479    0.183       2.62  8.91e-  3
 4 illness1       0.946    0.113       8.38  5.44e- 17
 5 illness2       1.21     0.118      10.3   8.29e- 25
 6 illness3       1.12     0.132       8.50  1.93e- 17
 7 illness4       1.28     0.140       9.17  4.71e- 20
 8 illness5       1.45     0.139      10.5   1.05e- 25
 9 reduced        0.126    0.00560    22.6   1.12e-112
10 health         0.0350   0.0112      3.11  1.84e-  3
11 privateyes     0.100    0.0793      1.27  2.06e-  1
12 freepooryes   -0.290    0.188      -1.55  1.22e-  1
13 freerepatyes   0.0683   0.102       0.667 5.05e-  1
14 nchronicyes    0.0171   0.0731      0.235 8.15e-  1
15 lchronicyes    0.0282   0.0914      0.308 7.58e-  1&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;For the interpretation of the coefficient estimates, we should exponentiate these values to get the marginal effect since the poisson model uses the log link function to preclude negative values. For continuous predictor, say age, if this predictor increases by one year, ceteris-paribus, we expect the doctor visits will be &lt;span class=&#34;math inline&#34;&gt;\(exp(0.47876624)=1.614082\)&lt;/span&gt; times larger. whereas, for categorical predictor, say gender, the female has &lt;span class=&#34;math inline&#34;&gt;\(exp(0.21342446)=1.23791\)&lt;/span&gt; larger doctor visits than male.&lt;br /&gt;
By looking at p-values all the predictors are significant. However, we have to check other statistics and metrics.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;glance(model1)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;# A tibble: 1 x 8
  null.deviance df.null logLik   AIC   BIC deviance df.residual  nobs
          &amp;lt;dbl&amp;gt;   &amp;lt;int&amp;gt;  &amp;lt;dbl&amp;gt; &amp;lt;dbl&amp;gt; &amp;lt;dbl&amp;gt;    &amp;lt;dbl&amp;gt;       &amp;lt;int&amp;gt; &amp;lt;int&amp;gt;
1         4565.    4154 -2685. 5399. 5494.    3486.        4140  4155&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;since the deviance value &lt;strong&gt;3485.905&lt;/strong&gt; is lower than the degrees of freedom &lt;strong&gt;4140&lt;/strong&gt;, we will then worry about &lt;strong&gt;overdispersion&lt;/strong&gt; problem.
Fortunateley, the &lt;strong&gt;AER&lt;/strong&gt; package provides a super easy way to test the significance of this difference via the function &lt;strong&gt;dispersiontest&lt;/strong&gt;.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;dispersiontest(model1)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;
    Overdispersion test

data:  model1
z = 6.278, p-value = 1.714e-10
alternative hypothesis: true dispersion is greater than 1
sample estimates:
dispersion 
  1.397176 &lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;If our target variable really follows poisson distribution then its variance &lt;span class=&#34;math inline&#34;&gt;\(V\)&lt;/span&gt; should be approximately equal to its mean &lt;span class=&#34;math inline&#34;&gt;\(\mu\)&lt;/span&gt;, which is the null hypothesis of the following &lt;strong&gt;dispersiontest&lt;/strong&gt; test against the alternative hypothesis that the variance of the form:
&lt;span class=&#34;math display&#34;&gt;\[V=\mu+\alpha.trafo(\mu)\]&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;Where the &lt;strong&gt;trafo&lt;/strong&gt; is an hyperparameter that should be specified as an argument of this test. The popular choices for this argument are:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;trafo = NULL (default): &lt;span class=&#34;math inline&#34;&gt;\(V=(1+\alpha)\mu\)&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;trafo = 1: &lt;span class=&#34;math inline&#34;&gt;\(V=\mu+\alpha.\mu\)&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;trafo = 2: &lt;span class=&#34;math inline&#34;&gt;\(V=\mu+\alpha.\mu^2\)&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;For the first choice if true, then the data will be better modeled by quasi-poisson model than poisson model.
For the last ones if one of them is true then the negative binomial will be better than poisson model.&lt;br /&gt;
Now once the trafo is defined the test estimates &lt;span class=&#34;math inline&#34;&gt;\(\alpha\)&lt;/span&gt;, such that:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;if &lt;span class=&#34;math inline&#34;&gt;\(\alpha = 0\)&lt;/span&gt; : equidispersion (The null hypothesis)&lt;/li&gt;
&lt;li&gt;if &lt;span class=&#34;math inline&#34;&gt;\(\alpha &amp;lt; 0\)&lt;/span&gt; : underdispersion&lt;/li&gt;
&lt;li&gt;if &lt;span class=&#34;math inline&#34;&gt;\(\alpha &amp;gt; 0\)&lt;/span&gt; : overdispersion&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Therefore, the result of the test will depend on the direction of the test, where we have &lt;strong&gt;two.sided&lt;/strong&gt;, &lt;strong&gt;greater&lt;/strong&gt; (default) for the overdispersion, and &lt;strong&gt;less&lt;/strong&gt; for underdispersion.&lt;/p&gt;
&lt;p&gt;With this in mind the output of the above test (with the default values) tested the overdispersion against the quasi-poisson model, and since the p-value is very small &lt;strong&gt;1.714e-10&lt;/strong&gt; then we have overdispersion problem, suggesting the use of quasi-poisson model instead.&lt;/p&gt;
&lt;p&gt;Now let’s test the negative binomial now.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;dispersiontest(model1, trafo = 1)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;
    Overdispersion test

data:  model1
z = 6.278, p-value = 1.714e-10
alternative hypothesis: true alpha is greater than 0
sample estimates:
    alpha 
0.3971763 &lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The test suggested the use of negative binomial with linear function for the variance with very tiny p-value &lt;strong&gt;1.714e-10&lt;/strong&gt;. This model is known as NB1 (with linear variance function).&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;dispersiontest(model1, trafo = 2)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;
    Overdispersion test

data:  model1
z = 7.4723, p-value = 3.939e-14
alternative hypothesis: true alpha is greater than 0
sample estimates:
  alpha 
0.95488 &lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;If the relation is in quadratic form then this model is called NB2. And since this p-value &lt;strong&gt;3.939e-14&lt;/strong&gt; is smaller than the previous one then NB2 could be more appropriate than NB1.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;quasi-poisson-model&#34; class=&#34;section level1&#34; number=&#34;5&#34;&gt;
&lt;h1&gt;&lt;span class=&#34;header-section-number&#34;&gt;5&lt;/span&gt; Quasi poisson model&lt;/h1&gt;
&lt;p&gt;The first test suggested the use of quasi-poisson model, so let’s train this model with the same predictors as the previous one.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;set.seed(123)
model2&amp;lt;-glm(visits~.-income, data=train, family =&amp;quot;quasipoisson&amp;quot;)
tidy(model2)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;# A tibble: 15 x 5
   term         estimate std.error statistic  p.value
   &amp;lt;chr&amp;gt;           &amp;lt;dbl&amp;gt;     &amp;lt;dbl&amp;gt;     &amp;lt;dbl&amp;gt;    &amp;lt;dbl&amp;gt;
 1 (Intercept)   -2.83     0.140     -20.2   1.25e-86
 2 genderfemale   0.213    0.0705      3.03  2.47e- 3
 3 age            0.479    0.212       2.26  2.39e- 2
 4 illness1       0.946    0.131       7.24  5.36e-13
 5 illness2       1.21     0.136       8.89  9.15e-19
 6 illness3       1.12     0.152       7.34  2.49e-13
 7 illness4       1.28     0.162       7.92  2.91e-15
 8 illness5       1.45     0.160       9.06  2.01e-19
 9 reduced        0.126    0.00647    19.5   4.79e-81
10 health         0.0350   0.0130      2.69  7.14e- 3
11 privateyes     0.100    0.0917      1.09  2.74e- 1
12 freepooryes   -0.290    0.217      -1.34  1.81e- 1
13 freerepatyes   0.0683   0.119       0.576 5.64e- 1
14 nchronicyes    0.0171   0.0846      0.203 8.39e- 1
15 lchronicyes    0.0282   0.106       0.266 7.90e- 1&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This model uses the quasi maximum likelihood which gives the same coefficient estimates but with different (corrected) standard errors.&lt;br /&gt;
Since here also all the variables are significant We see that the models are the same except the correction of the standard errors which are now more larger. In other words, the poisson distribution under overdispersion underestimates the standard errors and hence the &lt;strong&gt;t test&lt;/strong&gt; would be biased towards the rejection of the null hypothesis.
To better understand what is going on with quasi-poisson model let’s put the estimates and the standard errors of both models into one table, and we add a column that resulted from dividing the second standard errors vector by the first one.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;D1 &amp;lt;- tidy(model1)
colnames(D1) &amp;lt;- NULL
D2 &amp;lt;- tidy(model2)
colnames(D2) &amp;lt;- NULL
tibble(term=D1[[1]], estimate1=D1[[2]], std1=D1[[3]],estimate2=D2[[2]], std2=D2[[3]], dispersion= std2/std1)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;# A tibble: 15 x 6
   term         estimate1    std1 estimate2    std2 dispersion
   &amp;lt;chr&amp;gt;            &amp;lt;dbl&amp;gt;   &amp;lt;dbl&amp;gt;     &amp;lt;dbl&amp;gt;   &amp;lt;dbl&amp;gt;      &amp;lt;dbl&amp;gt;
 1 (Intercept)    -2.83   0.121     -2.83   0.140         1.16
 2 genderfemale    0.213  0.0609     0.213  0.0705        1.16
 3 age             0.479  0.183      0.479  0.212         1.16
 4 illness1        0.946  0.113      0.946  0.131         1.16
 5 illness2        1.21   0.118      1.21   0.136         1.16
 6 illness3        1.12   0.132      1.12   0.152         1.16
 7 illness4        1.28   0.140      1.28   0.162         1.16
 8 illness5        1.45   0.139      1.45   0.160         1.16
 9 reduced         0.126  0.00560    0.126  0.00647       1.16
10 health          0.0350 0.0112     0.0350 0.0130        1.16
11 privateyes      0.100  0.0793     0.100  0.0917        1.16
12 freepooryes    -0.290  0.188     -0.290  0.217         1.16
13 freerepatyes    0.0683 0.102      0.0683 0.119         1.16
14 nchronicyes     0.0171 0.0731     0.0171 0.0846        1.16
15 lchronicyes     0.0282 0.0914     0.0282 0.106         1.16&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;strong&gt;Note&lt;/strong&gt;: The first two columns are for the model1, and the last one are for the model 2.
Not surprisingly that the result of the last column is constant since this is exactly what the quasi maximum likelihood does, it computes the corrected standard errors from the original ones as follows &lt;span class=&#34;math inline&#34;&gt;\(std2=dispersion*std1\)&lt;/span&gt;, with the dispersion value being estimated as &lt;strong&gt;1.15718&lt;/strong&gt;. if you want to know where this value came from, the answer is simple. this model computes the sigma of the standardized residuals resulted from the original model. we can thus get this value by specifying the argument type to &lt;strong&gt;pear&lt;/strong&gt; then computing sigma by hand as follows:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;resid &amp;lt;- resid(model1, type = &amp;quot;pear&amp;quot;)
sqrt(sum(resid^2)/4140)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;[1] 1.15718&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Now to test the prediction qualities of our models we use the testing set &lt;strong&gt;test&lt;/strong&gt; by ploting the original and the predicted values.
Let’s start by the model1&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;pred&amp;lt;- predict.glm(model1,newdata=test[test$visits!=0,],type = &amp;quot;response&amp;quot;)
plot(test$visits[test$visits!=0],type = &amp;quot;b&amp;quot;,col=&amp;quot;red&amp;quot;)
lines(round(pred),col=&amp;quot;blue&amp;quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://modelingwithr.rbind.io/post/count_data/2020-01-06-count-data-models_files/figure-html/unnamed-chunk-19-1.svg&#34; width=&#34;576&#34; /&gt;&lt;/p&gt;
&lt;p&gt;If you noticed, and due to the large number of zero’s of the target variable, i have intentionally removed all theses values in order to get clearer plot.
From this plot we can say that the model does not fit well the data especially the larger values that are not well captured, however this may due to the fact that the data are very skewed towards zero.&lt;/p&gt;
&lt;p&gt;To compare different models we can use the &lt;strong&gt;root mean-square error&lt;/strong&gt; and &lt;strong&gt;mean absolute error&lt;/strong&gt; (all the data with zero’s included).
&lt;strong&gt;Note&lt;/strong&gt;: Here we are using the &lt;strong&gt;rmse&lt;/strong&gt; function from &lt;strong&gt;ModelMetrics&lt;/strong&gt; that expects the inpute to be two vectors, and not that with the same name from the &lt;strong&gt;performance&lt;/strong&gt; package that expects the input to be a model object . To avoid thus any such ambiguity you should type this command &lt;code&gt;ModelMetrics::rmse&lt;/code&gt;.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;pred &amp;lt;- predict.glm(model1, newdata = test, type = &amp;quot;response&amp;quot;)
rmsemodelp &amp;lt;- ModelMetrics::rmse(test$visits,round(pred))
maemodelp &amp;lt;- mae(test$visits,round(pred))
rmsemodelp&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;[1] 0.7381921&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;maemodelp&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;[1] 0.284058&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;By the same way, Now let’s evaluate the quasi-poisson model.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;predq&amp;lt;- predict.glm(model2,newdata=test[test$visits!=0,],type = &amp;quot;response&amp;quot;)
plot(test$visits[test$visits!=0],type = &amp;quot;b&amp;quot;,col=&amp;quot;red&amp;quot;)
lines(round(predq),col=&amp;quot;blue&amp;quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://modelingwithr.rbind.io/post/count_data/2020-01-06-count-data-models_files/figure-html/unnamed-chunk-21-1.svg&#34; width=&#34;576&#34; /&gt;&lt;/p&gt;
&lt;p&gt;This plot does not seem to be very different from the previous plot.
The rmse and mae for this model are computed as follows.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;predq &amp;lt;- predict.glm(model2,newdata=test, type = &amp;quot;response&amp;quot;)
rmsemodelqp &amp;lt;- ModelMetrics::rmse(test$visits,round(predq))
maemodelqp &amp;lt;- mae(test$visits,round(predq))
rmsemodelqp&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;[1] 0.7381921&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;maemodelqp&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;[1] 0.284058&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;we will not compare this two models until we finish with all the incoming models and we compare all the models at once.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;negative-binomial-model&#34; class=&#34;section level1&#34; number=&#34;6&#34;&gt;
&lt;h1&gt;&lt;span class=&#34;header-section-number&#34;&gt;6&lt;/span&gt; Negative binomial model&lt;/h1&gt;
&lt;p&gt;The negative binomial distribution is used as an alternative for the poisson distribution under overdispersion problem.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;set.seed(123)
model3&amp;lt;-glm.nb(visits~.-income, data=train)
summary(model3)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;
Call:
glm.nb(formula = visits ~ . - income, data = train, init.theta = 0.9715923611, 
    link = log)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.8413  -0.6894  -0.5335  -0.3540   3.6726  

Coefficients:
              Estimate Std. Error z value Pr(&amp;gt;|z|)    
(Intercept)  -2.894820   0.135760 -21.323  &amp;lt; 2e-16 ***
genderfemale  0.258968   0.075352   3.437 0.000589 ***
age           0.511867   0.230297   2.223 0.026240 *  
illness1      0.880644   0.123264   7.144 9.04e-13 ***
illness2      1.171615   0.130240   8.996  &amp;lt; 2e-16 ***
illness3      1.118067   0.149032   7.502 6.28e-14 ***
illness4      1.263367   0.165370   7.640 2.18e-14 ***
illness5      1.378166   0.169907   8.111 5.01e-16 ***
reduced       0.141389   0.008184  17.275  &amp;lt; 2e-16 ***
health        0.041364   0.015029   2.752 0.005918 ** 
privateyes    0.086188   0.095173   0.906 0.365149    
freepooryes  -0.375471   0.223857  -1.677 0.093487 .  
freerepatyes  0.144928   0.127751   1.134 0.256602    
nchronicyes   0.022111   0.087590   0.252 0.800705    
lchronicyes   0.091622   0.114965   0.797 0.425477    
---
Signif. codes:  0 &amp;#39;***&amp;#39; 0.001 &amp;#39;**&amp;#39; 0.01 &amp;#39;*&amp;#39; 0.05 &amp;#39;.&amp;#39; 0.1 &amp;#39; &amp;#39; 1

(Dispersion parameter for Negative Binomial(0.9716) family taken to be 1)

    Null deviance: 3208.0  on 4154  degrees of freedom
Residual deviance: 2431.2  on 4140  degrees of freedom
AIC: 5159.6

Number of Fisher Scoring iterations: 1

              Theta:  0.972 
          Std. Err.:  0.103 

 2 x log-likelihood:  -5127.587 &lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;As before we visualize the performance of this model as follows.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;prednb&amp;lt;- predict.glm(model3,newdata=test[test$visits!=0,],type = &amp;quot;response&amp;quot;)
plot(test$visits[test$visits!=0],type = &amp;quot;b&amp;quot;,col=&amp;quot;red&amp;quot;)
lines(round(prednb),col=&amp;quot;blue&amp;quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://modelingwithr.rbind.io/post/count_data/2020-01-06-count-data-models_files/figure-html/unnamed-chunk-24-1.svg&#34; width=&#34;576&#34; /&gt;&lt;/p&gt;
&lt;p&gt;Again this plot also seems to be the same as the previous ones, so to figure out which model is best we use statistic metrics.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;prednb&amp;lt;- predict.glm(model3,newdata=test,type = &amp;quot;response&amp;quot;)
rmsemodelnb&amp;lt;-ModelMetrics::rmse(test$visits,round(prednb))
maemodelnb&amp;lt;-mae(test$visits,round(prednb))
knitr::kable(tibble(rms=rmsemodelnb,mae=maemodelnb))&lt;/code&gt;&lt;/pre&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr class=&#34;header&#34;&gt;
&lt;th align=&#34;right&#34;&gt;rms&lt;/th&gt;
&lt;th align=&#34;right&#34;&gt;mae&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;right&#34;&gt;0.7808085&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0.2966184&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;we will use these ouputs further.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;hurdle-model&#34; class=&#34;section level1&#34; number=&#34;7&#34;&gt;
&lt;h1&gt;&lt;span class=&#34;header-section-number&#34;&gt;7&lt;/span&gt; Hurdle model&lt;/h1&gt;
&lt;p&gt;Originally proposed by Mullahy (1986) this model can take into account the fact that the data has more zeros and also can handle the overdispersion problem. It has two components (or steps), truncated count component defined by the chosen discrete distribution such as poisson or negative binomial, and a hurdle components models zero vs larger counts (that uses censored count distribution or binomial model). In other words, this models asumes that two population distributions underlying the data, one distribution for zero values, and another different distribution the psotive values. For more detail about hurdle and zero inflated models click &lt;a href=&#34;https://cran.r-project.org/web/packages/pscl/vignettes/countreg.pdf#cite.countreg%3AZeileis%3A2006&#34;&gt;here&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;To perform this model we make use of the function &lt;strong&gt;hurdle&lt;/strong&gt; from the package &lt;strong&gt;pscl&lt;/strong&gt;.&lt;/p&gt;
&lt;div id=&#34;hurdle-model-with-poisson-distribution.&#34; class=&#34;section level2&#34; number=&#34;7.1&#34;&gt;
&lt;h2&gt;&lt;span class=&#34;header-section-number&#34;&gt;7.1&lt;/span&gt; hurdle model with poisson distribution.&lt;/h2&gt;
&lt;p&gt;This model works in two steps. In the first step it uses binary classification to discriminate between the zero values and the positive values, and in the second step uses the traditional (poisson or binomial model, and here we use poisson model) model for positive values.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;library(pscl)
set.seed(123)
modelhp&amp;lt;-hurdle(visits~. -income, data=train,dist = &amp;quot;poisson&amp;quot;)
summary(modelhp)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;
Call:
hurdle(formula = visits ~ . - income, data = train, dist = &amp;quot;poisson&amp;quot;)

Pearson residuals:
    Min      1Q  Median      3Q     Max 
-1.5464 -0.4686 -0.3306 -0.2075 11.0887 

Count model coefficients (truncated poisson with log link):
              Estimate Std. Error z value Pr(&amp;gt;|z|)    
(Intercept)  -0.977535   0.261835  -3.733 0.000189 ***
genderfemale  0.073326   0.098034   0.748 0.454480    
age           0.032762   0.287991   0.114 0.909427    
illness1      0.370071   0.251920   1.469 0.141833    
illness2      0.403514   0.256363   1.574 0.115489    
illness3      0.201724   0.278757   0.724 0.469277    
illness4      0.420285   0.277573   1.514 0.129990    
illness5      0.762209   0.269809   2.825 0.004728 ** 
reduced       0.111640   0.007967  14.013  &amp;lt; 2e-16 ***
health        0.007682   0.016452   0.467 0.640554    
privateyes   -0.215649   0.129860  -1.661 0.096787 .  
freepooryes   0.066277   0.269699   0.246 0.805879    
freerepatyes -0.434941   0.166196  -2.617 0.008870 ** 
nchronicyes   0.109660   0.125380   0.875 0.381779    
lchronicyes   0.135612   0.142766   0.950 0.342166    
Zero hurdle model coefficients (binomial with logit link):
              Estimate Std. Error z value Pr(&amp;gt;|z|)    
(Intercept)  -3.224621   0.156469 -20.609  &amp;lt; 2e-16 ***
genderfemale  0.305324   0.089648   3.406 0.000660 ***
age           0.700345   0.276089   2.537 0.011191 *  
illness1      0.885148   0.136951   6.463 1.02e-10 ***
illness2      1.238227   0.146059   8.478  &amp;lt; 2e-16 ***
illness3      1.263698   0.169344   7.462 8.50e-14 ***
illness4      1.405167   0.195388   7.192 6.40e-13 ***
illness5      1.445585   0.208425   6.936 4.04e-12 ***
reduced       0.154858   0.013488  11.481  &amp;lt; 2e-16 ***
health        0.070464   0.019142   3.681 0.000232 ***
privateyes    0.271192   0.112751   2.405 0.016163 *  
freepooryes  -0.546177   0.277942  -1.965 0.049406 *  
freerepatyes  0.423220   0.153994   2.748 0.005991 ** 
nchronicyes  -0.006256   0.102033  -0.061 0.951106    
lchronicyes   0.070658   0.140587   0.503 0.615251    
---
Signif. codes:  0 &amp;#39;***&amp;#39; 0.001 &amp;#39;**&amp;#39; 0.01 &amp;#39;*&amp;#39; 0.05 &amp;#39;.&amp;#39; 0.1 &amp;#39; &amp;#39; 1 

Number of iterations in BFGS optimization: 22 
Log-likelihood: -2581 on 30 Df&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;As we see this output has two tables. The above one is for the poisson model performed only on the truncated positive values, and the below one is the result of the logistic regression with only two classes (zero or positive value)&lt;br /&gt;
As we did before we plot the results.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;predhp&amp;lt;- predict(modelhp,newdata=test[test$visits!=0,],type = &amp;quot;response&amp;quot;)
plot(test$visits[test$visits!=0],type = &amp;quot;b&amp;quot;,col=&amp;quot;red&amp;quot;)
lines(round(predhp),col=&amp;quot;blue&amp;quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://modelingwithr.rbind.io/post/count_data/2020-01-06-count-data-models_files/figure-html/unnamed-chunk-27-1.svg&#34; width=&#34;576&#34; /&gt;&lt;/p&gt;
&lt;p&gt;As before by only looking at the plot we can not decide which model is the best. So it is better to use the statistic metrics.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;predhp&amp;lt;- predict(modelhp,newdata=test, type = &amp;quot;response&amp;quot;)
rmsemodelhp&amp;lt;-ModelMetrics::rmse(test$visits,round(predhp))
maemodelhp&amp;lt;-mae(test$visits,round(predhp))
knitr::kable(tibble(rmse=rmsemodelhp,mae=
maemodelhp))&lt;/code&gt;&lt;/pre&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr class=&#34;header&#34;&gt;
&lt;th align=&#34;right&#34;&gt;rmse&lt;/th&gt;
&lt;th align=&#34;right&#34;&gt;mae&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;right&#34;&gt;0.7375374&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0.2850242&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;/div&gt;
&lt;div id=&#34;hurdle-model-with-negative-binomial-distribution.&#34; class=&#34;section level2&#34; number=&#34;7.2&#34;&gt;
&lt;h2&gt;&lt;span class=&#34;header-section-number&#34;&gt;7.2&lt;/span&gt; hurdle model with negative binomial distribution.&lt;/h2&gt;
&lt;p&gt;Now let’s try to use the negative binomial instead of poisson distribution.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;set.seed(123)
modelhnb&amp;lt;-hurdle(visits~.-income, data=train,dist = &amp;quot;negbin&amp;quot;)
summary(modelhnb)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;
Call:
hurdle(formula = visits ~ . - income, data = train, dist = &amp;quot;negbin&amp;quot;)

Pearson residuals:
    Min      1Q  Median      3Q     Max 
-0.9078 -0.4515 -0.3201 -0.2022 10.6552 

Count model coefficients (truncated negbin with log link):
             Estimate Std. Error z value Pr(&amp;gt;|z|)    
(Intercept)  -3.68462    2.65037  -1.390   0.1645    
genderfemale  0.07432    0.17299   0.430   0.6675    
age           0.26774    0.54001   0.496   0.6200    
illness1      0.27678    0.34564   0.801   0.4233    
illness2      0.37093    0.35241   1.053   0.2925    
illness3      0.04728    0.39747   0.119   0.9053    
illness4      0.40386    0.40517   0.997   0.3189    
illness5      0.68213    0.41357   1.649   0.0991 .  
reduced       0.15813    0.01935   8.171 3.05e-16 ***
health        0.01891    0.03291   0.575   0.5656    
privateyes   -0.45711    0.23118  -1.977   0.0480 *  
freepooryes   0.03334    0.55282   0.060   0.9519    
freerepatyes -0.59189    0.30437  -1.945   0.0518 .  
nchronicyes   0.08737    0.21061   0.415   0.6783    
lchronicyes   0.30274    0.25846   1.171   0.2415    
Log(theta)   -2.80552    2.80120  -1.002   0.3166    
Zero hurdle model coefficients (binomial with logit link):
              Estimate Std. Error z value Pr(&amp;gt;|z|)    
(Intercept)  -3.224621   0.156469 -20.609  &amp;lt; 2e-16 ***
genderfemale  0.305324   0.089648   3.406 0.000660 ***
age           0.700345   0.276089   2.537 0.011191 *  
illness1      0.885148   0.136951   6.463 1.02e-10 ***
illness2      1.238227   0.146059   8.478  &amp;lt; 2e-16 ***
illness3      1.263698   0.169344   7.462 8.50e-14 ***
illness4      1.405167   0.195388   7.192 6.40e-13 ***
illness5      1.445585   0.208425   6.936 4.04e-12 ***
reduced       0.154858   0.013488  11.481  &amp;lt; 2e-16 ***
health        0.070464   0.019142   3.681 0.000232 ***
privateyes    0.271192   0.112751   2.405 0.016163 *  
freepooryes  -0.546177   0.277942  -1.965 0.049406 *  
freerepatyes  0.423220   0.153994   2.748 0.005991 ** 
nchronicyes  -0.006256   0.102033  -0.061 0.951106    
lchronicyes   0.070658   0.140587   0.503 0.615251    
---
Signif. codes:  0 &amp;#39;***&amp;#39; 0.001 &amp;#39;**&amp;#39; 0.01 &amp;#39;*&amp;#39; 0.05 &amp;#39;.&amp;#39; 0.1 &amp;#39; &amp;#39; 1 

Theta: count = 0.0605
Number of iterations in BFGS optimization: 31 
Log-likelihood: -2524 on 31 Df&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;And let’s plot the difference between the predicted and the actual values of the testing set. .&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;predhnb&amp;lt;- predict(modelhnb,newdata=test[test$visits!=0,],type = &amp;quot;response&amp;quot;)
plot(test$visits[test$visits!=0],type = &amp;quot;b&amp;quot;,col=&amp;quot;red&amp;quot;)
lines(round(predhnb),col=&amp;quot;blue&amp;quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://modelingwithr.rbind.io/post/count_data/2020-01-06-count-data-models_files/figure-html/unnamed-chunk-30-1.svg&#34; width=&#34;576&#34; /&gt;&lt;/p&gt;
&lt;p&gt;And for the metrics.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;predhnb&amp;lt;- predict(modelhnb,newdata=test,type = &amp;quot;response&amp;quot;)
rmsemodelhnb&amp;lt;-ModelMetrics::rmse(test$visits,round(predhnb))
maemodelhnb&amp;lt;-mae(test$visits,round(predhnb))
knitr::kable(tibble(rmse=rmsemodelhnb,mae=
maemodelhnb))&lt;/code&gt;&lt;/pre&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr class=&#34;header&#34;&gt;
&lt;th align=&#34;right&#34;&gt;rmse&lt;/th&gt;
&lt;th align=&#34;right&#34;&gt;mae&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;right&#34;&gt;0.7408052&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0.2879227&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div id=&#34;zero-inflated-model&#34; class=&#34;section level1&#34; number=&#34;8&#34;&gt;
&lt;h1&gt;&lt;span class=&#34;header-section-number&#34;&gt;8&lt;/span&gt; Zero inflated model&lt;/h1&gt;
&lt;p&gt;Such as the previous model type , this model also combines two components but with the difference that this model performs a mixture of binomial distribution (between zero and positive values) and the poisson (or negative binomial) distribution for the rest of the values (with the zero included).&lt;/p&gt;
&lt;div id=&#34;zero-inflated-model-with-poisson-distribution&#34; class=&#34;section level2&#34; number=&#34;8.1&#34;&gt;
&lt;h2&gt;&lt;span class=&#34;header-section-number&#34;&gt;8.1&lt;/span&gt; Zero inflated model with poisson distribution&lt;/h2&gt;
&lt;p&gt;Here also we fit tow models one with poisson and one with negative binomial&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;set.seed(123)
modelzp&amp;lt;-zeroinfl(visits~.-income, data=train,dist = &amp;quot;poisson&amp;quot;)
summary(modelzp)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;
Call:
zeroinfl(formula = visits ~ . - income, data = train, dist = &amp;quot;poisson&amp;quot;)

Pearson residuals:
    Min      1Q  Median      3Q     Max 
-1.6247 -0.4791 -0.3326 -0.1783 12.3448 

Count model coefficients (poisson with log link):
             Estimate Std. Error z value Pr(&amp;gt;|z|)    
(Intercept)  -0.86467    0.23608  -3.663  0.00025 ***
genderfemale  0.03280    0.09078   0.361  0.71789    
age           0.13331    0.26922   0.495  0.62049    
illness1      0.32986    0.21846   1.510  0.13105    
illness2      0.34800    0.22426   1.552  0.12071    
illness3      0.20400    0.24152   0.845  0.39832    
illness4      0.44020    0.24324   1.810  0.07034 .  
illness5      0.72463    0.23632   3.066  0.00217 ** 
reduced       0.09679    0.00809  11.964  &amp;lt; 2e-16 ***
health        0.02269    0.01609   1.410  0.15860    
privateyes   -0.26390    0.12796  -2.062  0.03918 *  
freepooryes   0.04860    0.27675   0.176  0.86059    
freerepatyes -0.51894    0.17070  -3.040  0.00237 ** 
nchronicyes   0.08577    0.11490   0.746  0.45538    
lchronicyes   0.10876    0.12745   0.853  0.39346    

Zero-inflation model coefficients (binomial with logit link):
             Estimate Std. Error z value Pr(&amp;gt;|z|)    
(Intercept)   2.44812    0.34933   7.008 2.42e-12 ***
genderfemale -0.48766    0.19038  -2.562 0.010422 *  
age          -0.88816    0.59030  -1.505 0.132431    
illness1     -0.80833    0.31248  -2.587 0.009685 ** 
illness2     -1.41461    0.35338  -4.003 6.25e-05 ***
illness3     -1.69204    0.44028  -3.843 0.000121 ***
illness4     -1.52224    0.46334  -3.285 0.001019 ** 
illness5     -1.08742    0.46493  -2.339 0.019342 *  
reduced      -0.14462    0.03861  -3.746 0.000180 ***
health       -0.05796    0.04486  -1.292 0.196386    
privateyes   -0.73945    0.22597  -3.272 0.001066 ** 
freepooryes   0.73371    0.41402   1.772 0.076370 .  
freerepatyes -1.75454    0.53938  -3.253 0.001142 ** 
nchronicyes   0.13229    0.22623   0.585 0.558697    
lchronicyes   0.03647    0.30620   0.119 0.905194    
---
Signif. codes:  0 &amp;#39;***&amp;#39; 0.001 &amp;#39;**&amp;#39; 0.01 &amp;#39;*&amp;#39; 0.05 &amp;#39;.&amp;#39; 0.1 &amp;#39; &amp;#39; 1 

Number of iterations in BFGS optimization: 42 
Log-likelihood: -2579 on 30 Df&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;predzp&amp;lt;- predict(modelzp,newdata=test[test$visits!=0,],type = &amp;quot;response&amp;quot;)
plot(test$visits[test$visits!=0],type = &amp;quot;b&amp;quot;,col=&amp;quot;red&amp;quot;)
lines(round(predzp),col=&amp;quot;blue&amp;quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://modelingwithr.rbind.io/post/count_data/2020-01-06-count-data-models_files/figure-html/unnamed-chunk-33-1.svg&#34; width=&#34;576&#34; /&gt;&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;predzp&amp;lt;- predict(modelzp,newdata=test,type = &amp;quot;response&amp;quot;)
rmsemodelzp&amp;lt;-ModelMetrics::rmse(test$visits,round(predzp))
maemodelzp&amp;lt;-mae(test$visits,round(predzp))
knitr::kable(tibble(rmse=rmsemodelzp,mae=
maemodelzp))&lt;/code&gt;&lt;/pre&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr class=&#34;header&#34;&gt;
&lt;th align=&#34;right&#34;&gt;rmse&lt;/th&gt;
&lt;th align=&#34;right&#34;&gt;mae&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;right&#34;&gt;0.7485897&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0.2898551&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;/div&gt;
&lt;div id=&#34;zero-inflated-model-with-negative-binomial-distribution&#34; class=&#34;section level2&#34; number=&#34;8.2&#34;&gt;
&lt;h2&gt;&lt;span class=&#34;header-section-number&#34;&gt;8.2&lt;/span&gt; Zero inflated model with negative binomial distribution&lt;/h2&gt;
&lt;p&gt;Let’s this time try the negative binomial distribution.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;set.seed(123)
modelznb&amp;lt;-zeroinfl(visits~., data=train,dist = &amp;quot;negbin&amp;quot;)
summary(modelznb)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;
Call:
zeroinfl(formula = visits ~ ., data = train, dist = &amp;quot;negbin&amp;quot;)

Pearson residuals:
    Min      1Q  Median      3Q     Max 
-1.0440 -0.4582 -0.3031 -0.1680 14.2061 

Count model coefficients (negbin with log link):
              Estimate Std. Error z value Pr(&amp;gt;|z|)    
(Intercept)  -1.383100   0.241778  -5.721 1.06e-08 ***
genderfemale  0.038985   0.090534   0.431 0.666751    
age           0.028837   0.290843   0.099 0.921019    
income       -0.196592   0.147560  -1.332 0.182767    
illness1      0.613258   0.174652   3.511 0.000446 ***
illness2      0.692297   0.179663   3.853 0.000117 ***
illness3      0.664613   0.196061   3.390 0.000699 ***
illness4      0.760162   0.204137   3.724 0.000196 ***
illness5      0.944756   0.206097   4.584 4.56e-06 ***
reduced       0.102651   0.008776  11.697  &amp;lt; 2e-16 ***
health        0.044012   0.015536   2.833 0.004611 ** 
privateyes   -0.168864   0.138680  -1.218 0.223358    
freepooryes  -0.422748   0.306653  -1.379 0.168022    
freerepatyes -0.383558   0.163995  -2.339 0.019344 *  
nchronicyes   0.033374   0.107881   0.309 0.757048    
lchronicyes   0.065834   0.128987   0.510 0.609775    
Log(theta)    0.473936   0.142626   3.323 0.000891 ***

Zero-inflation model coefficients (binomial with logit link):
               Estimate Std. Error z value Pr(&amp;gt;|z|)    
(Intercept)   2.280e+00  5.322e-01   4.285 1.83e-05 ***
genderfemale -7.269e-01  2.753e-01  -2.640  0.00828 ** 
age          -2.003e+00  9.202e-01  -2.177  0.02951 *  
income       -1.803e-01  3.933e-01  -0.458  0.64669    
illness1     -3.327e-01  3.480e-01  -0.956  0.33894    
illness2     -1.112e+00  4.496e-01  -2.473  0.01339 *  
illness3     -9.533e-01  5.127e-01  -1.859  0.06297 .  
illness4     -1.551e+00  7.398e-01  -2.097  0.03599 *  
illness5     -1.230e+00  8.597e-01  -1.431  0.15257    
reduced      -1.298e+00  4.577e-01  -2.836  0.00456 ** 
health       -1.443e-03  5.509e-02  -0.026  0.97910    
privateyes   -8.179e-01  3.178e-01  -2.574  0.01005 *  
freepooryes   2.394e-01  6.648e-01   0.360  0.71878    
freerepatyes -1.572e+01  1.528e+03  -0.010  0.99179    
nchronicyes   4.502e-02  2.982e-01   0.151  0.88001    
lchronicyes  -1.637e-01  4.951e-01  -0.331  0.74085    
---
Signif. codes:  0 &amp;#39;***&amp;#39; 0.001 &amp;#39;**&amp;#39; 0.01 &amp;#39;*&amp;#39; 0.05 &amp;#39;.&amp;#39; 0.1 &amp;#39; &amp;#39; 1 

Theta = 1.6063 
Number of iterations in BFGS optimization: 66 
Log-likelihood: -2512 on 33 Df&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;predznb&amp;lt;- predict(modelznb,newdata=test,type = &amp;quot;response&amp;quot;)
rmsemodelznb&amp;lt;-ModelMetrics::rmse(test$visits,round(predznb))
maemodelznb&amp;lt;-mae(test$visits,round(predznb))
knitr::kable(tibble(rmse=rmsemodelznb,mae=maemodelznb))&lt;/code&gt;&lt;/pre&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr class=&#34;header&#34;&gt;
&lt;th align=&#34;right&#34;&gt;rmse&lt;/th&gt;
&lt;th align=&#34;right&#34;&gt;mae&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;right&#34;&gt;0.7309579&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0.2753623&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;Finally let’s compare all the above models.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;rmse&amp;lt;-c(rmsemodelp,rmsemodelqp,rmsemodelnb,rmsemodelhp,rmsemodelhnb,
           rmsemodelzp,rmsemodelznb)
mae&amp;lt;-c(maemodelp,maemodelqp,maemodelnb,maemodelhp,maemodelhnb,
           maemodelzp,maemodelznb)
models&amp;lt;-c(&amp;quot;pois&amp;quot;,&amp;quot;q_pois&amp;quot;,&amp;quot;nb&amp;quot;,&amp;quot;h_pois&amp;quot;,&amp;quot;h_nb&amp;quot;,&amp;quot;zer_pois&amp;quot;,&amp;quot;zer_nb&amp;quot;)

data.frame(models,rmse,mae)%&amp;gt;% 
  arrange(rmse)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;    models      rmse       mae
1   zer_nb 0.7309579 0.2753623
2   h_pois 0.7375374 0.2850242
3     pois 0.7381921 0.2840580
4   q_pois 0.7381921 0.2840580
5     h_nb 0.7408052 0.2879227
6 zer_pois 0.7485897 0.2898551
7       nb 0.7808085 0.2966184&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Both metrics have chosen the zero inflated negative binomial model as the best model with minimum rmse value &lt;strong&gt;0.7309579&lt;/strong&gt;, and minimum mae value &lt;strong&gt;0.2753623&lt;/strong&gt;. this result is in line with the fact that this kind of models take care of the zero inflated data and at the same time the overdispersion problem.&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div id=&#34;conclusion&#34; class=&#34;section level1&#34; number=&#34;9&#34;&gt;
&lt;h1&gt;&lt;span class=&#34;header-section-number&#34;&gt;9&lt;/span&gt; Conclusion:&lt;/h1&gt;
&lt;p&gt;If the data is truly follows Poisson distribution then all the the other models have extra parameters that, during training process, converges to the optimum parameter values for poisson, this relation is like the linear regression to the generalized least squares. However, if the data is very skewed towards zero then it should be better to use the last two models to take care of this issue.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;furhter-reading&#34; class=&#34;section level1&#34; number=&#34;10&#34;&gt;
&lt;h1&gt;&lt;span class=&#34;header-section-number&#34;&gt;10&lt;/span&gt; Furhter reading:&lt;/h1&gt;
&lt;ul&gt;
&lt;li&gt;Michael J. Crawley, The R book, WILEY, UK, 2013. &lt;a href=&#34;http://www.bio.ic.ac.uk/research/mjcraw/therbook/index.htm&#34; class=&#34;uri&#34;&gt;http://www.bio.ic.ac.uk/research/mjcraw/therbook/index.htm&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
&lt;div id=&#34;session-info&#34; class=&#34;section level1&#34; number=&#34;11&#34;&gt;
&lt;h1&gt;&lt;span class=&#34;header-section-number&#34;&gt;11&lt;/span&gt; Session info&lt;/h1&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;sessionInfo()&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;R version 4.0.1 (2020-06-06)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19041)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252 
[2] LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] pscl_1.5.5           broom_0.7.1          AER_1.2-9           
 [4] survival_3.2-7       sandwich_3.0-0       lmtest_0.9-38       
 [7] zoo_1.8-8            car_3.0-10           carData_3.0-4       
[10] forcats_0.5.0        stringr_1.4.0        dplyr_1.0.2         
[13] readr_1.3.1          tidyr_1.1.2          tibble_3.0.3        
[16] ggplot2_3.3.2        tidyverse_1.3.0      MASS_7.3-53         
[19] purrr_0.3.4          corrr_0.4.2          ModelMetrics_1.2.2.2
[22] performance_0.5.0   

loaded via a namespace (and not attached):
 [1] httr_1.4.2        jsonlite_1.7.1    splines_4.0.1     modelr_0.1.8     
 [5] Formula_1.2-3     assertthat_0.2.1  highr_0.8         blob_1.2.1       
 [9] cellranger_1.1.0  yaml_2.2.1        bayestestR_0.7.2  pillar_1.4.6     
[13] backports_1.1.10  lattice_0.20-41   glue_1.4.2        digest_0.6.25    
[17] rvest_0.3.6       colorspace_1.4-1  htmltools_0.5.0   Matrix_1.2-18    
[21] pkgconfig_2.0.3   haven_2.3.1       bookdown_0.20     scales_1.1.1     
[25] openxlsx_4.2.2    rio_0.5.16        farver_2.0.3      generics_0.0.2   
[29] ellipsis_0.3.1    withr_2.3.0       cli_2.0.2         magrittr_1.5     
[33] crayon_1.3.4      readxl_1.3.1      evaluate_0.14     fs_1.5.0         
[37] fansi_0.4.1       xml2_1.3.2        foreign_0.8-80    blogdown_0.20    
[41] tools_4.0.1       data.table_1.13.0 hms_0.5.3         lifecycle_0.2.0  
[45] munsell_0.5.0     reprex_0.3.0      zip_2.1.1         compiler_4.0.1   
[49] rlang_0.4.7       grid_4.0.1        rstudioapi_0.11   labeling_0.3     
[53] rmarkdown_2.4     gtable_0.3.0      abind_1.4-5       DBI_1.1.0        
[57] curl_4.3          R6_2.4.1          lubridate_1.7.9   knitr_1.30       
[61] utf8_1.1.4        insight_0.9.6     stringi_1.5.3     Rcpp_1.0.5       
[65] vctrs_0.3.4       dbplyr_1.4.4      tidyselect_1.1.0  xfun_0.18        &lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
</description>
    </item>
    
    <item>
      <title>Xgboost model</title>
      <link>https://modelingwithr.rbind.io/post/xgboost/xgboost/</link>
      <pubDate>Sun, 05 Jan 2020 00:00:00 +0000</pubDate>
      <guid>https://modelingwithr.rbind.io/post/xgboost/xgboost/</guid>
      <description>
&lt;script src=&#34;https://modelingwithr.rbind.io/rmarkdown-libs/header-attrs/header-attrs.js&#34;&gt;&lt;/script&gt;

&lt;div id=&#34;TOC&#34;&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&#34;#introduction&#34;&gt;&lt;span class=&#34;toc-section-number&#34;&gt;1&lt;/span&gt; Introduction&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#data-preparation&#34;&gt;&lt;span class=&#34;toc-section-number&#34;&gt;2&lt;/span&gt; Data preparation&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#data-visualization&#34;&gt;&lt;span class=&#34;toc-section-number&#34;&gt;3&lt;/span&gt; Data visualization&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#data-partition&#34;&gt;&lt;span class=&#34;toc-section-number&#34;&gt;4&lt;/span&gt; Data partition&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#model-training&#34;&gt;&lt;span class=&#34;toc-section-number&#34;&gt;5&lt;/span&gt; Model training&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#fine-tune-the-hyperparameters&#34;&gt;&lt;span class=&#34;toc-section-number&#34;&gt;6&lt;/span&gt; Fine tune the hyperparameters&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#conclusion&#34;&gt;&lt;span class=&#34;toc-section-number&#34;&gt;7&lt;/span&gt; Conclusion:&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#session-information&#34;&gt;&lt;span class=&#34;toc-section-number&#34;&gt;8&lt;/span&gt; Session information&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;

&lt;style type=&#34;text/css&#34;&gt;
strong {
  color: Navy;
}

h1,h2, h3, h4 {
  font-size:28px;
  color:DarkBlue;
}
&lt;/style&gt;
&lt;div id=&#34;introduction&#34; class=&#34;section level1&#34; number=&#34;1&#34;&gt;
&lt;h1&gt;&lt;span class=&#34;header-section-number&#34;&gt;1&lt;/span&gt; Introduction&lt;/h1&gt;
&lt;p&gt;Decision tree&lt;a href=&#34;#fn1&#34; class=&#34;footnote-ref&#34; id=&#34;fnref1&#34;&gt;&lt;sup&gt;1&lt;/sup&gt;&lt;/a&gt; is a model that recursively splits the input space into regions and defines local model for each resulted region. However, fitting decision tree model to complex data would not yield to accurate prediction in most cases, which can be termed as &lt;a href=&#34;http://rob.schapire.net/papers/strengthofweak.pdf&#34;&gt;weak learner&lt;/a&gt;. But combining multiple decision trees together (called also &lt;strong&gt;ensemble models&lt;/strong&gt;) using techniques such as aggregating and boosting can largely improve the model accuracy. &lt;a href=&#34;https://xgboost.readthedocs.io/en/latest/R-package/index.html&#34;&gt;Xgboost&lt;/a&gt; (short for Extreme gradient boosting) model is a tree-based algorithm that uses these types of techniques. It can be used for both &lt;strong&gt;classification&lt;/strong&gt; and &lt;strong&gt;regression&lt;/strong&gt;.
In this paper we learn how to implement this model to predict the well known titanic data as we did in the previous papers using different kind of models.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;data-preparation&#34; class=&#34;section level1&#34; number=&#34;2&#34;&gt;
&lt;h1&gt;&lt;span class=&#34;header-section-number&#34;&gt;2&lt;/span&gt; Data preparation&lt;/h1&gt;
&lt;p&gt;First we start by calling the packages needed and the titanic data&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;suppressPackageStartupMessages(library(tidyverse))&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## Warning: package &amp;#39;ggplot2&amp;#39; was built under R version 4.0.2&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## Warning: package &amp;#39;tibble&amp;#39; was built under R version 4.0.2&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## Warning: package &amp;#39;tidyr&amp;#39; was built under R version 4.0.2&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## Warning: package &amp;#39;dplyr&amp;#39; was built under R version 4.0.2&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;suppressPackageStartupMessages(library(caret))
data &amp;lt;- read_csv(&amp;quot;../train.csv&amp;quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## Parsed with column specification:
## cols(
##   PassengerId = col_double(),
##   Survived = col_double(),
##   Pclass = col_double(),
##   Name = col_character(),
##   Sex = col_character(),
##   Age = col_double(),
##   SibSp = col_double(),
##   Parch = col_double(),
##   Ticket = col_character(),
##   Fare = col_double(),
##   Cabin = col_character(),
##   Embarked = col_character()
## )&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Let’s take a look at this data using the &lt;strong&gt;dplyr&lt;/strong&gt; function &lt;strong&gt;glimpse&lt;/strong&gt;.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;glimpse(data)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## Rows: 891
## Columns: 12
## $ PassengerId &amp;lt;dbl&amp;gt; 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, ...
## $ Survived    &amp;lt;dbl&amp;gt; 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0...
## $ Pclass      &amp;lt;dbl&amp;gt; 3, 1, 3, 1, 3, 3, 1, 3, 3, 2, 3, 1, 3, 3, 3, 2, 3, 2, 3...
## $ Name        &amp;lt;chr&amp;gt; &amp;quot;Braund, Mr. Owen Harris&amp;quot;, &amp;quot;Cumings, Mrs. John Bradley ...
## $ Sex         &amp;lt;chr&amp;gt; &amp;quot;male&amp;quot;, &amp;quot;female&amp;quot;, &amp;quot;female&amp;quot;, &amp;quot;female&amp;quot;, &amp;quot;male&amp;quot;, &amp;quot;male&amp;quot;, &amp;quot;...
## $ Age         &amp;lt;dbl&amp;gt; 22, 38, 26, 35, 35, NA, 54, 2, 27, 14, 4, 58, 20, 39, 1...
## $ SibSp       &amp;lt;dbl&amp;gt; 1, 1, 0, 1, 0, 0, 0, 3, 0, 1, 1, 0, 0, 1, 0, 0, 4, 0, 1...
## $ Parch       &amp;lt;dbl&amp;gt; 0, 0, 0, 0, 0, 0, 0, 1, 2, 0, 1, 0, 0, 5, 0, 0, 1, 0, 0...
## $ Ticket      &amp;lt;chr&amp;gt; &amp;quot;A/5 21171&amp;quot;, &amp;quot;PC 17599&amp;quot;, &amp;quot;STON/O2. 3101282&amp;quot;, &amp;quot;113803&amp;quot;, ...
## $ Fare        &amp;lt;dbl&amp;gt; 7.2500, 71.2833, 7.9250, 53.1000, 8.0500, 8.4583, 51.86...
## $ Cabin       &amp;lt;chr&amp;gt; NA, &amp;quot;C85&amp;quot;, NA, &amp;quot;C123&amp;quot;, NA, NA, &amp;quot;E46&amp;quot;, NA, NA, NA, &amp;quot;G6&amp;quot;,...
## $ Embarked    &amp;lt;chr&amp;gt; &amp;quot;S&amp;quot;, &amp;quot;C&amp;quot;, &amp;quot;S&amp;quot;, &amp;quot;S&amp;quot;, &amp;quot;S&amp;quot;, &amp;quot;Q&amp;quot;, &amp;quot;S&amp;quot;, &amp;quot;S&amp;quot;, &amp;quot;S&amp;quot;, &amp;quot;C&amp;quot;, &amp;quot;S&amp;quot;, ...&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;For prediction purposes some variables should be removed such as PassengerId, Name, Ticket, and Cabin. While some others should be converted to another suitable type. the following script performs these transformations but for more detail you can refer to my previous paper of logistic regression.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;mydata&amp;lt;-data[,-c(1,4,9,11)]
mydata$Survived&amp;lt;-as.integer(mydata$Survived)
mydata&amp;lt;-modify_at(mydata,c(&amp;quot;Pclass&amp;quot;,&amp;quot;Sex&amp;quot;,&amp;quot;Embarked&amp;quot;,&amp;quot;SibSp&amp;quot;,&amp;quot;Parch&amp;quot;), as.factor)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Now let’s check the summary of the transformed data.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;summary(mydata)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;##     Survived      Pclass      Sex           Age        SibSp   Parch  
##  Min.   :0.0000   1:216   female:314   Min.   : 0.42   0:608   0:678  
##  1st Qu.:0.0000   2:184   male  :577   1st Qu.:20.12   1:209   1:118  
##  Median :0.0000   3:491                Median :28.00   2: 28   2: 80  
##  Mean   :0.3838                        Mean   :29.70   3: 16   3:  5  
##  3rd Qu.:1.0000                        3rd Qu.:38.00   4: 18   4:  4  
##  Max.   :1.0000                        Max.   :80.00   5:  5   5:  5  
##                                        NA&amp;#39;s   :177     8:  7   6:  1  
##       Fare        Embarked  
##  Min.   :  0.00   C   :168  
##  1st Qu.:  7.91   Q   : 77  
##  Median : 14.45   S   :644  
##  Mean   : 32.20   NA&amp;#39;s:  2  
##  3rd Qu.: 31.00             
##  Max.   :512.33             
## &lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;As we see, we have 177 missing values from age variable and 2 values from Embarked. For missing values we have two strategies, removing completely the missing values from the analysis, but doing so we will lose many data, or imputing them by one of the available imputation method to fix these values. Since we have large number of missing values compared to the total examples in the data it would be better to follow the latter strategy. Thankfully to &lt;a href=&#34;https://cran.r-project.org/web/packages/mice/mice.pdf&#34;&gt;mice&lt;/a&gt; package that is a very powerfull for this purpose and it provides many imputation methods for all variable types.
We will opt for random forest method since in most cases can be the best choice. However, in order to respect the most important rule in machine learning, never touch the test data during the training process , we will apply this imputation after splitting the data.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;data-visualization&#34; class=&#34;section level1&#34; number=&#34;3&#34;&gt;
&lt;h1&gt;&lt;span class=&#34;header-section-number&#34;&gt;3&lt;/span&gt; Data visualization&lt;/h1&gt;
&lt;p&gt;We have many tools outside modelization to investigate some relationships between variables like visualization tools. So we can visualize the relationship between each predictor and the target variable using the ggplot2 package.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;library(ggplot2)
ggplot(mydata,aes(Sex,Survived,color=Sex))+
  geom_point()+
  geom_jitter()&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://modelingwithr.rbind.io/post/xgboost/xgboost_files/figure-html/unnamed-chunk-6-1.svg&#34; width=&#34;576&#34; /&gt;&lt;/p&gt;
&lt;p&gt;The left side of the plot shows that higher fraction of females survived, whereas the right side shows the reverse situation for males where most of them died. We can induce from this plot that, ceteris paribus, this predictor is likely to be relevant for prediction.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;ggplot(mydata,aes(Pclass,Survived,color=Pclass))+
  geom_point()+
  geom_jitter()&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://modelingwithr.rbind.io/post/xgboost/xgboost_files/figure-html/unnamed-chunk-7-1.svg&#34; width=&#34;576&#34; /&gt;&lt;/p&gt;
&lt;p&gt;in this plot most of the first class passengers survived in contrast with the third class passengers where most of them died. However, for the second class, it seems equally balanced. Again this predictor also can be relevant.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;ggplot(mydata,aes(SibSp,Survived,color=SibSp))+
  geom_point()+
  geom_jitter()&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://modelingwithr.rbind.io/post/xgboost/xgboost_files/figure-html/unnamed-chunk-8-1.svg&#34; width=&#34;576&#34; /&gt;&lt;/p&gt;
&lt;p&gt;This predictor refers to the number of siblings a passenger has. It seems to be equally distributed given the target variable, and hence can be highly irrelevant. In other words, knowing the number of siblings of a particular passenger does not help to predict if this passenger survived or died.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;ggplot(mydata,aes(Parch,Survived,color=Parch))+
  geom_point()+
  geom_jitter()&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://modelingwithr.rbind.io/post/xgboost/xgboost_files/figure-html/unnamed-chunk-9-1.svg&#34; width=&#34;576&#34; /&gt;&lt;/p&gt;
&lt;p&gt;This predictor refers to the number of parents and children a passenger has. It seems that this predictor is slightly discriminative if we look closely at the level 0, passengers with no parents or children.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;ggplot(mydata,aes(Embarked,Survived,color=Embarked))+
  geom_point()+
  geom_jitter()&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://modelingwithr.rbind.io/post/xgboost/xgboost_files/figure-html/unnamed-chunk-10-1.svg&#34; width=&#34;576&#34; /&gt;&lt;/p&gt;
&lt;p&gt;We see that a passenger who is embarked from the port &lt;strong&gt;S&lt;/strong&gt; is slightly highly to be died, while the other ports seem to be equally distributed.&lt;/p&gt;
&lt;p&gt;For numeric variables we use the empirical densitiy givan the target variable as follows.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;ggplot(mydata[complete.cases(mydata),], aes(Age,fill=as.factor(Survived)))+
  geom_density(alpha=.5)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://modelingwithr.rbind.io/post/xgboost/xgboost_files/figure-html/unnamed-chunk-11-1.svg&#34; width=&#34;576&#34; /&gt;&lt;/p&gt;
&lt;p&gt;We see that some significant overlapping between the two conditional distribution may indicating less relevance related to this variable.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;ggplot(mydata, aes(Fare,fill=as.factor(Survived)))+
  geom_density(alpha=.5)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://modelingwithr.rbind.io/post/xgboost/xgboost_files/figure-html/unnamed-chunk-12-1.svg&#34; width=&#34;576&#34; /&gt;&lt;/p&gt;
&lt;p&gt;For this variables the conditional distribution are different, we see a spike close to zero reflecting the more death among third class.&lt;/p&gt;
&lt;p&gt;we can also plot two predictors against each other. For instance let’s try with the two predictors, Sex and Pclass:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;ggplot(mydata,aes(Sex,Pclass,color=as.factor(Survived)))+
  geom_point(col=&amp;quot;green&amp;quot;,pch=16,cex=7)+
  geom_jitter()&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://modelingwithr.rbind.io/post/xgboost/xgboost_files/figure-html/unnamed-chunk-13-1.svg&#34; width=&#34;576&#34; /&gt;&lt;/p&gt;
&lt;p&gt;The majority of the survived females (blue points on the left) came from the first and the second class, while the majority of died males (red points on the right) came from the third class.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;data-partition&#34; class=&#34;section level1&#34; number=&#34;4&#34;&gt;
&lt;h1&gt;&lt;span class=&#34;header-section-number&#34;&gt;4&lt;/span&gt; Data partition&lt;/h1&gt;
&lt;p&gt;we take out 80% of the data as training set and the remaining will be served as testing set.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;set.seed(1234)
index&amp;lt;-createDataPartition(mydata$Survived,p=0.8,list=FALSE)
train&amp;lt;-mydata[index,]&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## Warning: The `i` argument of ``[`()` can&amp;#39;t be a matrix as of tibble 3.0.0.
## Convert to a vector.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;test&amp;lt;-mydata[-index,]&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Now we are ready to impute the missing values.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;suppressPackageStartupMessages(library(mice))&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## Warning: package &amp;#39;mice&amp;#39; was built under R version 4.0.2&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;imput_train&amp;lt;-mice(train,m=3,seed=111, method = &amp;#39;rf&amp;#39;)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## Warning: Number of logged events: 30&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;train2&amp;lt;-complete(imput_train,1)
summary(train2)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;From this output we see that we do not have missing values any more.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;model-training&#34; class=&#34;section level1&#34; number=&#34;5&#34;&gt;
&lt;h1&gt;&lt;span class=&#34;header-section-number&#34;&gt;5&lt;/span&gt; Model training&lt;/h1&gt;
&lt;p&gt;The xgboost model expects the predictors to be of numeric type, so we convert the factors to dummy variables by the help of the &lt;strong&gt;Matrix&lt;/strong&gt; package&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;suppressPackageStartupMessages(library(Matrix))
train_data&amp;lt;-sparse.model.matrix(Survived ~. -1, data=train2)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Note that the -1 value added to the formula is to avoid adding a column as intercept with ones to our data. we can take a look at the structure of the data by the following&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;str(train_data)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## Formal class &amp;#39;dgCMatrix&amp;#39; [package &amp;quot;Matrix&amp;quot;] with 6 slots
##   ..@ i       : int [1:3570] 1 3 5 8 17 20 23 24 27 28 ...
##   ..@ p       : int [1:21] 0 178 329 713 1173 1886 2062 2086 2100 2114 ...
##   ..@ Dim     : int [1:2] 713 20
##   ..@ Dimnames:List of 2
##   .. ..$ : chr [1:713] &amp;quot;1&amp;quot; &amp;quot;2&amp;quot; &amp;quot;3&amp;quot; &amp;quot;4&amp;quot; ...
##   .. ..$ : chr [1:20] &amp;quot;Pclass1&amp;quot; &amp;quot;Pclass2&amp;quot; &amp;quot;Pclass3&amp;quot; &amp;quot;Sexmale&amp;quot; ...
##   ..@ x       : num [1:3570] 1 1 1 1 1 1 1 1 1 1 ...
##   ..@ factors : list()&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;We know that many machine learning algorithms require the inputs to be in a specific type. The input types supported by xgboost algorithm are: matrix, &lt;strong&gt;dgCMatrix&lt;/strong&gt; object rendered from the above package &lt;strong&gt;Matrix&lt;/strong&gt;, or the xgboost class &lt;strong&gt;xgb.DMatrix&lt;/strong&gt;.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;suppressPackageStartupMessages(library(xgboost))&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## Warning: package &amp;#39;xgboost&amp;#39; was built under R version 4.0.2&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;We should first store the dependent variable in a separate vector, let’s call it &lt;strong&gt;train_label&lt;/strong&gt;&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;train_label&amp;lt;-train$Survived
dim(train_data)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## [1] 713  20&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;length(train$Survived)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## [1] 713&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Now we bind the predictors, contained in the train_data , with the train_label vector as &lt;strong&gt;xgb.DMatrix&lt;/strong&gt; object as follows&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;train_final&amp;lt;-xgb.DMatrix(data = train_data,label=train_label)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;To train the model you must provide the inputs and specify the argument values if we do not want to keep the following values:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;objective: for binary classification we use &lt;strong&gt;binary:logistic&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;eta (default=0.3): The learning rate.&lt;/li&gt;
&lt;li&gt;gamma (default=0): also called min_split_loss, the minimum loss required for splitting further a particular node.&lt;/li&gt;
&lt;li&gt;max_depth(default=6): the maximum depth of the tree.&lt;/li&gt;
&lt;li&gt;min_child_weight(default=1): the minimum number of instances required in a node under which the node will be leaf.&lt;/li&gt;
&lt;li&gt;subsample (default=1): with the default the model uses all the data at each tree, if 0.7 for instance, then the model randomly sample 70% of the data at each iteration, doing so we fight the overfiting problem.&lt;/li&gt;
&lt;li&gt;colsample_bytree (default=1, select all columns): subsample ratio of columns at each iteration.&lt;/li&gt;
&lt;li&gt;nthreads (default=2): number of cpu’s used in parallel processing.&lt;/li&gt;
&lt;li&gt;nrounds : the number of boosting iterations.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;You can check the whole parameters by typing &lt;strong&gt;?xgboost&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;It should be noted that the input data can feed into the model by two ways:
It the data is of class &lt;strong&gt;xgb.DMatrix&lt;/strong&gt; that contain both the predictors and the label, as we did, then we do not use the &lt;strong&gt;label&lt;/strong&gt; argument. Otherwise, with any other class we provide both argument data and label.&lt;/p&gt;
&lt;p&gt;Let’s our first attempt will be made with 40 iterations and the default values for the other arguments.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;mymodel &amp;lt;- xgboost(data=train_final, objective = &amp;quot;binary:logistic&amp;quot;,
                   nrounds = 40)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## [1]  train-error:0.148668 
## [2]  train-error:0.133240 
## [3]  train-error:0.130435 
## [4]  train-error:0.137447 
## [5]  train-error:0.127630 
## [6]  train-error:0.117812 
## [7]  train-error:0.115007 
## [8]  train-error:0.109397 
## [9]  train-error:0.102384 
## [10] train-error:0.103787 
## [11] train-error:0.103787 
## [12] train-error:0.102384 
## [13] train-error:0.100982 
## [14] train-error:0.098177 
## [15] train-error:0.098177 
## [16] train-error:0.096774 
## [17] train-error:0.096774 
## [18] train-error:0.098177 
## [19] train-error:0.093969 
## [20] train-error:0.091164 
## [21] train-error:0.086957 
## [22] train-error:0.085554 
## [23] train-error:0.085554 
## [24] train-error:0.082749 
## [25] train-error:0.082749 
## [26] train-error:0.082749 
## [27] train-error:0.079944 
## [28] train-error:0.075736 
## [29] train-error:0.074334 
## [30] train-error:0.074334 
## [31] train-error:0.072931 
## [32] train-error:0.072931 
## [33] train-error:0.070126 
## [34] train-error:0.070126 
## [35] train-error:0.070126 
## [36] train-error:0.068724 
## [37] train-error:0.067321 
## [38] train-error:0.061711 
## [39] train-error:0.061711 
## [40] train-error:0.063114&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;We can plot the error rates as follows&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt; mymodel$evaluation_log %&amp;gt;%   
  ggplot(aes(iter, train_error))+
  geom_point()&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://modelingwithr.rbind.io/post/xgboost/xgboost_files/figure-html/unnamed-chunk-22-1.svg&#34; width=&#34;576&#34; /&gt;&lt;/p&gt;
&lt;p&gt;To evaluate the model we will use the test data that should follow all the above steps as the training data except for the missing values. since the test set is only used to evaluate the model so we will remove all the missing values.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;test1 &amp;lt;- test[complete.cases(test),]
test2&amp;lt;-sparse.model.matrix(Survived ~. -1,data=test1)
test_label&amp;lt;-test1$Survived
test_final&amp;lt;-xgb.DMatrix(data = test2, label=test_label)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Then we use the predict function and confusionMatrix function from caret package, and since the predicted values are probabbilities we convert them to predicted classes using the threshold of 0.5 as follows:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;pred &amp;lt;- predict(mymodel, test_final)
pred&amp;lt;-ifelse(pred&amp;gt;.5,1,0)
confusionMatrix(as.factor(pred),as.factor(test_label))&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 81 13
##          1 11 36
##                                           
##                Accuracy : 0.8298          
##                  95% CI : (0.7574, 0.8878)
##     No Information Rate : 0.6525          
##     P-Value [Acc &amp;gt; NIR] : 2.379e-06       
##                                           
##                   Kappa : 0.6211          
##                                           
##  Mcnemar&amp;#39;s Test P-Value : 0.8383          
##                                           
##             Sensitivity : 0.8804          
##             Specificity : 0.7347          
##          Pos Pred Value : 0.8617          
##          Neg Pred Value : 0.7660          
##              Prevalence : 0.6525          
##          Detection Rate : 0.5745          
##    Detection Prevalence : 0.6667          
##       Balanced Accuracy : 0.8076          
##                                           
##        &amp;#39;Positive&amp;#39; Class : 0               
## &lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;with the default values we obtain a pretty good accuracy rate. The next step we fine tune the hyperparameters sing &lt;strong&gt;cross validation&lt;/strong&gt; with the help of caret package.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;fine-tune-the-hyperparameters&#34; class=&#34;section level1&#34; number=&#34;6&#34;&gt;
&lt;h1&gt;&lt;span class=&#34;header-section-number&#34;&gt;6&lt;/span&gt; Fine tune the hyperparameters&lt;/h1&gt;
&lt;p&gt;for the hyperparameters we try different grid values for the above arguments as follows:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;eta: seq(0.2,1,0.2)&lt;/li&gt;
&lt;li&gt;max_depth: seq(2,6,1)&lt;/li&gt;
&lt;li&gt;min_child_weight: c(1,5,10)&lt;/li&gt;
&lt;li&gt;colsample_bytree : seq(0.6,1,0.1)&lt;/li&gt;
&lt;li&gt;nrounds : c(50,200 ,50)&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;This requires training the model 375 times.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;grid_tune &amp;lt;- expand.grid(
  nrounds = c(50,200,50),
  max_depth = seq(2,6,1),
  eta = seq(0.2,1,0.2),
  gamma = 0,
  min_child_weight = 1,
  colsample_bytree = seq(0.6,1,0.1),
  subsample = 1
  )&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Then we use 5 folds cross validation as follows.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;control &amp;lt;- trainControl(
  method = &amp;quot;repeatedcv&amp;quot;,
  number = 5,
  allowParallel = TRUE
)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Now instead we use the &lt;strong&gt;train&lt;/strong&gt; function from caret to train the model and we specify the method as &lt;strong&gt;xgbtree&lt;/strong&gt;.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;train_data1 &amp;lt;- as.matrix(train_data)
train_label1 &amp;lt;- as.factor(train_label)
#mymodel2 &amp;lt;- train(
#  x = train_data1,
#  y = train_label1,
#  trControl = control,
#  tuneGrid = grid_tune,
#  method = &amp;quot;xgbTree&amp;quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;strong&gt;Note&lt;/strong&gt;: This model took several minutes so we do not the model to be rerun again when rendering this document that is why i have commented the above script and have saved the results in csv file, then i have reloaded it again to continue our analysis. If you would like to run this model you can just uncomment the script.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# results &amp;lt;- mymodel2$results
# write_csv(results, &amp;quot;xgb_results.csv&amp;quot;)
results &amp;lt;- read_csv(&amp;quot;xgb_results.csv&amp;quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## Parsed with column specification:
## cols(
##   eta = col_double(),
##   max_depth = col_double(),
##   gamma = col_double(),
##   colsample_bytree = col_double(),
##   min_child_weight = col_double(),
##   subsample = col_double(),
##   nrounds = col_double(),
##   Accuracy = col_double(),
##   Kappa = col_double(),
##   AccuracySD = col_double(),
##   KappaSD = col_double()
## )&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Let’s now check the best hyperparameter values:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;results %&amp;gt;% 
  arrange(-Accuracy) %&amp;gt;% 
  head(5)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## # A tibble: 5 x 11
##     eta max_depth gamma colsample_bytree min_child_weight subsample nrounds
##   &amp;lt;dbl&amp;gt;     &amp;lt;dbl&amp;gt; &amp;lt;dbl&amp;gt;            &amp;lt;dbl&amp;gt;            &amp;lt;dbl&amp;gt;     &amp;lt;dbl&amp;gt;   &amp;lt;dbl&amp;gt;
## 1   0.2         4     0              0.6                1         1      50
## 2   0.2         6     0              0.6                1         1      50
## 3   0.8         2     0              0.8                1         1      50
## 4   0.4         3     0              0.6                1         1      50
## 5   0.2         3     0              1                  1         1     200
## # ... with 4 more variables: Accuracy &amp;lt;dbl&amp;gt;, Kappa &amp;lt;dbl&amp;gt;, AccuracySD &amp;lt;dbl&amp;gt;,
## #   KappaSD &amp;lt;dbl&amp;gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;As we see the highest accuracy rate is about 81.34% with the related hyperparameter values as follows.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;results %&amp;gt;% 
  arrange(-Accuracy) %&amp;gt;% 
  head(1)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## # A tibble: 1 x 11
##     eta max_depth gamma colsample_bytree min_child_weight subsample nrounds
##   &amp;lt;dbl&amp;gt;     &amp;lt;dbl&amp;gt; &amp;lt;dbl&amp;gt;            &amp;lt;dbl&amp;gt;            &amp;lt;dbl&amp;gt;     &amp;lt;dbl&amp;gt;   &amp;lt;dbl&amp;gt;
## 1   0.2         4     0              0.6                1         1      50
## # ... with 4 more variables: Accuracy &amp;lt;dbl&amp;gt;, Kappa &amp;lt;dbl&amp;gt;, AccuracySD &amp;lt;dbl&amp;gt;,
## #   KappaSD &amp;lt;dbl&amp;gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Now we apply these values for the final model using the whole data uploadded at the beginning from the train.csv file, and then we call the file test.csv file for titanic data to submit our prediction to the kaggle competition.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;imput_mydata&amp;lt;-mice(mydata,m=3,seed=111, method = &amp;#39;rf&amp;#39;)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## Warning: Number of logged events: 15&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;mydata_imp&amp;lt;-complete(imput_mydata,1)
my_data&amp;lt;-sparse.model.matrix(Survived ~. -1, data = mydata_imp)
mydata_label&amp;lt;-mydata$Survived
data_final&amp;lt;-xgb.DMatrix(data = my_data,label=mydata_label)
final_model &amp;lt;- xgboost(data=data_final, objective = &amp;quot;binary:logistic&amp;quot;,
                   nrounds = 50, max_depth = 4, eta = 0.2, gamma = 0,
                   colsample_bytree = 0.6, min_child_weight = 1)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;and we get the following result&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;pred &amp;lt;- predict(mymodel, data_final)
pred&amp;lt;-ifelse(pred&amp;gt;.5,1,0)
confusionMatrix(as.factor(pred),as.factor(mydata_label))&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 518  60
##          1  31 282
##                                          
##                Accuracy : 0.8979         
##                  95% CI : (0.8761, 0.917)
##     No Information Rate : 0.6162         
##     P-Value [Acc &amp;gt; NIR] : &amp;lt; 2.2e-16      
##                                          
##                   Kappa : 0.7806         
##                                          
##  Mcnemar&amp;#39;s Test P-Value : 0.003333       
##                                          
##             Sensitivity : 0.9435         
##             Specificity : 0.8246         
##          Pos Pred Value : 0.8962         
##          Neg Pred Value : 0.9010         
##              Prevalence : 0.6162         
##          Detection Rate : 0.5814         
##    Detection Prevalence : 0.6487         
##       Balanced Accuracy : 0.8840         
##                                          
##        &amp;#39;Positive&amp;#39; Class : 0              
## &lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The accuracy rate with these values is about 90% .
Now lets fit this model to the test.csv file.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;kag&amp;lt;-read_csv(&amp;quot;../test.csv&amp;quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## Parsed with column specification:
## cols(
##   PassengerId = col_double(),
##   Pclass = col_double(),
##   Name = col_character(),
##   Sex = col_character(),
##   Age = col_double(),
##   SibSp = col_double(),
##   Parch = col_double(),
##   Ticket = col_character(),
##   Fare = col_double(),
##   Cabin = col_character(),
##   Embarked = col_character()
## )&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;kag1&amp;lt;-kag[,-c(3,8,10)]
kag1 &amp;lt;- modify_at(kag1,c(&amp;quot;Pclass&amp;quot;, &amp;quot;Sex&amp;quot;, &amp;quot;Embarked&amp;quot;, &amp;quot;SibSp&amp;quot;, &amp;quot;Parch&amp;quot;), as.factor)
summary(kag1)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;##   PassengerId     Pclass      Sex           Age        SibSp       Parch    
##  Min.   : 892.0   1:107   female:152   Min.   : 0.17   0:283   0      :324  
##  1st Qu.: 996.2   2: 93   male  :266   1st Qu.:21.00   1:110   1      : 52  
##  Median :1100.5   3:218                Median :27.00   2: 14   2      : 33  
##  Mean   :1100.5                        Mean   :30.27   3:  4   3      :  3  
##  3rd Qu.:1204.8                        3rd Qu.:39.00   4:  4   4      :  2  
##  Max.   :1309.0                        Max.   :76.00   5:  1   9      :  2  
##                                        NA&amp;#39;s   :86      8:  2   (Other):  2  
##       Fare         Embarked
##  Min.   :  0.000   C:102   
##  1st Qu.:  7.896   Q: 46   
##  Median : 14.454   S:270   
##  Mean   : 35.627           
##  3rd Qu.: 31.500           
##  Max.   :512.329           
##  NA&amp;#39;s   :1&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;we have 86 missing values for Age and one for Far, using a good idea from a kaggler named &lt;strong&gt;Harrison Tietze&lt;/strong&gt; who suggested to treat the persons with missing values as likely to be died. For instance he replaced the missing ages by the mean age of died persons from the train data. But for us we go even further and we consider all rows with missing values as died persons.&lt;br /&gt;
Additionally, when inspecting the summary above we notice that we have an extra level (9) in the factor &lt;strong&gt;Parch&lt;/strong&gt; that is not existed in the traind data, and hence the model does not allow such extra information. However, since this level has only two cases we can approximate this level by the closest one which is 6, then we drop the level 9 from this factor.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;kag1$Parch[kag1$Parch==9]&amp;lt;-6
kag1$Parch &amp;lt;- kag1$Parch %&amp;gt;% forcats::fct_drop()
kag_died &amp;lt;- kag1[!complete.cases(kag1),]
kag2 &amp;lt;- kag1[complete.cases(kag1),]&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;So we only use the kag2 data for the prediction.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;DP&amp;lt;-sparse.model.matrix(PassengerId~.-1,data=kag2)
head(DP)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## 6 x 20 sparse Matrix of class &amp;quot;dgCMatrix&amp;quot;&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;##    [[ suppressing 20 column names &amp;#39;Pclass1&amp;#39;, &amp;#39;Pclass2&amp;#39;, &amp;#39;Pclass3&amp;#39; ... ]]&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;##                                                   
## 1 . . 1 1 34.5 . . . . . . . . . . . .  7.8292 1 .
## 2 . . 1 . 47.0 1 . . . . . . . . . . .  7.0000 . 1
## 3 . 1 . 1 62.0 . . . . . . . . . . . .  9.6875 1 .
## 4 . . 1 1 27.0 . . . . . . . . . . . .  8.6625 . 1
## 5 . . 1 . 22.0 1 . . . . . 1 . . . . . 12.2875 . 1
## 6 . . 1 1 14.0 . . . . . . . . . . . .  9.2250 . 1&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;predkag&amp;lt;-predict(final_model,DP)
head(predkag)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## [1] 0.10634395 0.17170778 0.09650294 0.12390183 0.60250586 0.11714594&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;As we see the output is the probability of each instance, so we should convert this probabbilitis to classe labels:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;predkag&amp;lt;-ifelse(predkag&amp;gt;.5,1,0)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Now first we cbined passengerId with the fitted values named as Survived, next we rbind with the first set kag1 :&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;predkag2K&amp;lt;-cbind(kag2[,1],Survived=predkag)
kag_died$Survived &amp;lt;- 0
predtestk &amp;lt;- rbind(predkag2K,kag_died[, c(1,9)])&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Finally, we save the file as csv file to submit it to kaggle then check our rank :&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;write_csv(predtestk,&amp;quot;predxgbkag.csv&amp;quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div id=&#34;conclusion&#34; class=&#34;section level1&#34; number=&#34;7&#34;&gt;
&lt;h1&gt;&lt;span class=&#34;header-section-number&#34;&gt;7&lt;/span&gt; Conclusion:&lt;/h1&gt;
&lt;p&gt;Xgboost is the best machine learning algorithm nowadays due to its powerful capability to predict wide range of data from various domains. Several win competitions in &lt;strong&gt;kaggle&lt;/strong&gt; and elsewhere are achieved by this model. It can handle large and complex data with ease. The large number of hyperparameters that has give the modeler a large possibilities to tune the model with respect to the data at their hand as well as to fight other problems such as overfitting, feature selection…ect.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;session-information&#34; class=&#34;section level1&#34; number=&#34;8&#34;&gt;
&lt;h1&gt;&lt;span class=&#34;header-section-number&#34;&gt;8&lt;/span&gt; Session information&lt;/h1&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;sessionInfo()&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## R version 4.0.1 (2020-06-06)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 19041)
## 
## Matrix products: default
## 
## locale:
## [1] LC_COLLATE=English_United States.1252 
## [2] LC_CTYPE=English_United States.1252   
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C                          
## [5] LC_TIME=English_United States.1252    
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] xgboost_1.2.0.1 Matrix_1.2-18   mice_3.11.0     caret_6.0-86   
##  [5] lattice_0.20-41 forcats_0.5.0   stringr_1.4.0   dplyr_1.0.2    
##  [9] purrr_0.3.4     readr_1.3.1     tidyr_1.1.2     tibble_3.0.3   
## [13] ggplot2_3.3.2   tidyverse_1.3.0
## 
## loaded via a namespace (and not attached):
##  [1] nlme_3.1-149         fs_1.5.0             lubridate_1.7.9     
##  [4] httr_1.4.2           tools_4.0.1          backports_1.1.10    
##  [7] utf8_1.1.4           R6_2.4.1             rpart_4.1-15        
## [10] DBI_1.1.0            colorspace_1.4-1     nnet_7.3-14         
## [13] withr_2.3.0          tidyselect_1.1.0     compiler_4.0.1      
## [16] cli_2.0.2            rvest_0.3.6          xml2_1.3.2          
## [19] labeling_0.3         bookdown_0.20        scales_1.1.1        
## [22] randomForest_4.6-14  digest_0.6.25        rmarkdown_2.4       
## [25] pkgconfig_2.0.3      htmltools_0.5.0      dbplyr_1.4.4        
## [28] rlang_0.4.7          readxl_1.3.1         rstudioapi_0.11     
## [31] generics_0.0.2       farver_2.0.3         jsonlite_1.7.1      
## [34] ModelMetrics_1.2.2.2 magrittr_1.5         Rcpp_1.0.5          
## [37] munsell_0.5.0        fansi_0.4.1          lifecycle_0.2.0     
## [40] stringi_1.5.3        pROC_1.16.2          yaml_2.2.1          
## [43] MASS_7.3-53          plyr_1.8.6           recipes_0.1.13      
## [46] grid_4.0.1           blob_1.2.1           crayon_1.3.4        
## [49] haven_2.3.1          splines_4.0.1        hms_0.5.3           
## [52] knitr_1.30           pillar_1.4.6         reshape2_1.4.4      
## [55] codetools_0.2-16     stats4_4.0.1         reprex_0.3.0        
## [58] glue_1.4.2           evaluate_0.14        blogdown_0.20       
## [61] data.table_1.13.0    modelr_0.1.8         vctrs_0.3.4         
## [64] foreach_1.5.0        cellranger_1.1.0     gtable_0.3.0        
## [67] assertthat_0.2.1     xfun_0.18            gower_0.2.2         
## [70] prodlim_2019.11.13   broom_0.7.1          e1071_1.7-3         
## [73] class_7.3-17         survival_3.2-7       timeDate_3043.102   
## [76] iterators_1.0.12     lava_1.6.8           ellipsis_0.3.1      
## [79] ipred_0.9-9&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div class=&#34;footnotes&#34;&gt;
&lt;hr /&gt;
&lt;ol&gt;
&lt;li id=&#34;fn1&#34;&gt;&lt;p&gt;Kevin P.Murphy 2012&lt;a href=&#34;#fnref1&#34; class=&#34;footnote-back&#34;&gt;↩︎&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;/div&gt;
</description>
    </item>
    
    <item>
      <title>logistic regression</title>
      <link>https://modelingwithr.rbind.io/post/logimodel/logimodel/</link>
      <pubDate>Thu, 19 Dec 2019 00:00:00 +0000</pubDate>
      <guid>https://modelingwithr.rbind.io/post/logimodel/logimodel/</guid>
      <description>
&lt;script src=&#34;https://modelingwithr.rbind.io/rmarkdown-libs/header-attrs/header-attrs.js&#34;&gt;&lt;/script&gt;

&lt;div id=&#34;TOC&#34;&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&#34;#introduction&#34;&gt;&lt;span class=&#34;toc-section-number&#34;&gt;1&lt;/span&gt; Introduction&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#data-preparation&#34;&gt;&lt;span class=&#34;toc-section-number&#34;&gt;2&lt;/span&gt; Data preparation&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#data-partition&#34;&gt;&lt;span class=&#34;toc-section-number&#34;&gt;3&lt;/span&gt; Data partition&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#train-the-model&#34;&gt;&lt;span class=&#34;toc-section-number&#34;&gt;4&lt;/span&gt; train the model&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#prediction-and-confusion-matrix&#34;&gt;&lt;span class=&#34;toc-section-number&#34;&gt;5&lt;/span&gt; prediction and confusion matrix&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#the-link-function&#34;&gt;&lt;span class=&#34;toc-section-number&#34;&gt;6&lt;/span&gt; The link function&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;

&lt;style type=&#34;text/css&#34;&gt;
strong {
  color: Navy;
}

h1,h2, h3, h4 {
  font-size:28px;
  color:DarkBlue;
}
&lt;/style&gt;
&lt;div id=&#34;introduction&#34; class=&#34;section level1&#34; number=&#34;1&#34;&gt;
&lt;h1&gt;&lt;span class=&#34;header-section-number&#34;&gt;1&lt;/span&gt; Introduction&lt;/h1&gt;
&lt;p&gt;In this paper we will fit a logistic regression model to the &lt;strong&gt;heart disease&lt;/strong&gt; data &lt;a href=&#34;https://www.kaggle.com/johnsmith88/heart-disease-dataset&#34;&gt;uploaded from kaggle website&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;For the data preparation we will follow the same steps as we did in my previous paper about &lt;strong&gt;naive bayes model&lt;/strong&gt;, for more detail thus click &lt;a href=&#34;https://github.com/Metalesaek/naive-bayes-model&#34;&gt;here&lt;/a&gt; to get access to that paper.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;data-preparation&#34; class=&#34;section level1&#34; number=&#34;2&#34;&gt;
&lt;h1&gt;&lt;span class=&#34;header-section-number&#34;&gt;2&lt;/span&gt; Data preparation&lt;/h1&gt;
&lt;p&gt;First we call our data with the required packages&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;library(tidyverse, warn.conflicts = FALSE)
library(caret, warn.conflicts = FALSE)
mydata&amp;lt;-read.csv(&amp;quot;heart.csv&amp;quot;,header = TRUE)
names(mydata)[1]&amp;lt;-&amp;quot;age&amp;quot;
glimpse(mydata)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## Rows: 303
## Columns: 14
## $ age      &amp;lt;int&amp;gt; 63, 37, 41, 56, 57, 57, 56, 44, 52, 57, 54, 48, 49, 64, 58...
## $ sex      &amp;lt;int&amp;gt; 1, 1, 0, 1, 0, 1, 0, 1, 1, 1, 1, 0, 1, 1, 0, 0, 0, 0, 1, 0...
## $ cp       &amp;lt;int&amp;gt; 3, 2, 1, 1, 0, 0, 1, 1, 2, 2, 0, 2, 1, 3, 3, 2, 2, 3, 0, 3...
## $ trestbps &amp;lt;int&amp;gt; 145, 130, 130, 120, 120, 140, 140, 120, 172, 150, 140, 130...
## $ chol     &amp;lt;int&amp;gt; 233, 250, 204, 236, 354, 192, 294, 263, 199, 168, 239, 275...
## $ fbs      &amp;lt;int&amp;gt; 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0...
## $ restecg  &amp;lt;int&amp;gt; 0, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1...
## $ thalach  &amp;lt;int&amp;gt; 150, 187, 172, 178, 163, 148, 153, 173, 162, 174, 160, 139...
## $ exang    &amp;lt;int&amp;gt; 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0...
## $ oldpeak  &amp;lt;dbl&amp;gt; 2.3, 3.5, 1.4, 0.8, 0.6, 0.4, 1.3, 0.0, 0.5, 1.6, 1.2, 0.2...
## $ slope    &amp;lt;int&amp;gt; 0, 0, 2, 2, 2, 1, 1, 2, 2, 2, 2, 2, 2, 1, 2, 1, 2, 0, 2, 2...
## $ ca       &amp;lt;int&amp;gt; 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2...
## $ thal     &amp;lt;int&amp;gt; 1, 2, 2, 2, 2, 1, 2, 3, 3, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2...
## $ target   &amp;lt;int&amp;gt; 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The data at hand has the following features:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;age.&lt;/li&gt;
&lt;li&gt;sex: 1=male,0=female&lt;/li&gt;
&lt;li&gt;cp : chest pain type.&lt;/li&gt;
&lt;li&gt;trestbps : resting blood pressure.&lt;/li&gt;
&lt;li&gt;chol: serum cholestoral.&lt;/li&gt;
&lt;li&gt;fbs : fasting blood sugar.&lt;/li&gt;
&lt;li&gt;restecg : resting electrocardiographic results.&lt;/li&gt;
&lt;li&gt;thalach : maximum heart rate achieved&lt;/li&gt;
&lt;li&gt;exang : exercise induced angina.&lt;/li&gt;
&lt;li&gt;oldpeak : ST depression induced by exercise relative to rest.&lt;/li&gt;
&lt;li&gt;slope : the slope of the peak exercise ST segment.&lt;/li&gt;
&lt;li&gt;ca : number of major vessels colored by flourosopy.&lt;/li&gt;
&lt;li&gt;thal : it is not well defined from the data source.&lt;/li&gt;
&lt;li&gt;target: have heart disease or not.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;We see that some features should be converted to factor type as follows:m&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;mydata&amp;lt;-mydata %&amp;gt;%
  modify_at(c(2,3,6,7,9,11,12,13,14),as.factor)
glimpse(mydata)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## Rows: 303
## Columns: 14
## $ age      &amp;lt;int&amp;gt; 63, 37, 41, 56, 57, 57, 56, 44, 52, 57, 54, 48, 49, 64, 58...
## $ sex      &amp;lt;fct&amp;gt; 1, 1, 0, 1, 0, 1, 0, 1, 1, 1, 1, 0, 1, 1, 0, 0, 0, 0, 1, 0...
## $ cp       &amp;lt;fct&amp;gt; 3, 2, 1, 1, 0, 0, 1, 1, 2, 2, 0, 2, 1, 3, 3, 2, 2, 3, 0, 3...
## $ trestbps &amp;lt;int&amp;gt; 145, 130, 130, 120, 120, 140, 140, 120, 172, 150, 140, 130...
## $ chol     &amp;lt;int&amp;gt; 233, 250, 204, 236, 354, 192, 294, 263, 199, 168, 239, 275...
## $ fbs      &amp;lt;fct&amp;gt; 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0...
## $ restecg  &amp;lt;fct&amp;gt; 0, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1...
## $ thalach  &amp;lt;int&amp;gt; 150, 187, 172, 178, 163, 148, 153, 173, 162, 174, 160, 139...
## $ exang    &amp;lt;fct&amp;gt; 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0...
## $ oldpeak  &amp;lt;dbl&amp;gt; 2.3, 3.5, 1.4, 0.8, 0.6, 0.4, 1.3, 0.0, 0.5, 1.6, 1.2, 0.2...
## $ slope    &amp;lt;fct&amp;gt; 0, 0, 2, 2, 2, 1, 1, 2, 2, 2, 2, 2, 2, 1, 2, 1, 2, 0, 2, 2...
## $ ca       &amp;lt;fct&amp;gt; 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2...
## $ thal     &amp;lt;fct&amp;gt; 1, 2, 2, 2, 2, 1, 2, 3, 3, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2...
## $ target   &amp;lt;fct&amp;gt; 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Before going head we should check the relationships between the target variable and the remaining factors&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;xtabs(~target+sex,data=mydata)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;##       sex
## target   0   1
##      0  24 114
##      1  72  93&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;xtabs(~target+cp,data=mydata)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;##       cp
## target   0   1   2   3
##      0 104   9  18   7
##      1  39  41  69  16&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;xtabs(~target+fbs,data=mydata)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;##       fbs
## target   0   1
##      0 116  22
##      1 142  23&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;xtabs(~target+restecg,data=mydata)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;##       restecg
## target  0  1  2
##      0 79 56  3
##      1 68 96  1&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;xtabs(~target+exang,data=mydata)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;##       exang
## target   0   1
##      0  62  76
##      1 142  23&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;xtabs(~target+slope,data=mydata)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;##       slope
## target   0   1   2
##      0  12  91  35
##      1   9  49 107&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;xtabs(~target+ca,data=mydata)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;##       ca
## target   0   1   2   3   4
##      0  45  44  31  17   1
##      1 130  21   7   3   4&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;xtabs(~target+thal,data=mydata)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;##       thal
## target   0   1   2   3
##      0   1  12  36  89
##      1   1   6 130  28&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;As we see the &lt;strong&gt;restecg&lt;/strong&gt;,&lt;strong&gt;ca&lt;/strong&gt; and &lt;strong&gt;thal&lt;/strong&gt; variables have values less than the threshold of 5 casses required for logistic regression. In addition if we split the data between training set and test set the level &lt;strong&gt;2&lt;/strong&gt; of the &lt;strong&gt;restecg&lt;/strong&gt; variable will not be found in one of the sets since we have only one case. Therfore we should remove these variables from the model.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;mydata&amp;lt;-mydata[,-c(7,12,13)]
glimpse(mydata)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## Rows: 303
## Columns: 11
## $ age      &amp;lt;int&amp;gt; 63, 37, 41, 56, 57, 57, 56, 44, 52, 57, 54, 48, 49, 64, 58...
## $ sex      &amp;lt;fct&amp;gt; 1, 1, 0, 1, 0, 1, 0, 1, 1, 1, 1, 0, 1, 1, 0, 0, 0, 0, 1, 0...
## $ cp       &amp;lt;fct&amp;gt; 3, 2, 1, 1, 0, 0, 1, 1, 2, 2, 0, 2, 1, 3, 3, 2, 2, 3, 0, 3...
## $ trestbps &amp;lt;int&amp;gt; 145, 130, 130, 120, 120, 140, 140, 120, 172, 150, 140, 130...
## $ chol     &amp;lt;int&amp;gt; 233, 250, 204, 236, 354, 192, 294, 263, 199, 168, 239, 275...
## $ fbs      &amp;lt;fct&amp;gt; 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0...
## $ thalach  &amp;lt;int&amp;gt; 150, 187, 172, 178, 163, 148, 153, 173, 162, 174, 160, 139...
## $ exang    &amp;lt;fct&amp;gt; 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0...
## $ oldpeak  &amp;lt;dbl&amp;gt; 2.3, 3.5, 1.4, 0.8, 0.6, 0.4, 1.3, 0.0, 0.5, 1.6, 1.2, 0.2...
## $ slope    &amp;lt;fct&amp;gt; 0, 0, 2, 2, 2, 1, 1, 2, 2, 2, 2, 2, 2, 1, 2, 1, 2, 0, 2, 2...
## $ target   &amp;lt;fct&amp;gt; 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Before training our model, we can get a vague insight about the predictors that have some importance for the prediction of the dependent variable.&lt;/p&gt;
&lt;p&gt;Let’s plot the relationships between the target variabl and the other features.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;ggplot(mydata,aes(sex,target,color=target))+
  geom_jitter()&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://modelingwithr.rbind.io/post/logimodel/logimodel_files/figure-html/unnamed-chunk-6-1.svg&#34; width=&#34;576&#34; /&gt;&lt;/p&gt;
&lt;p&gt;If we look only at the red points (healthy patients) we can wrongly interpret that females are less healthy than males. This is because we do not take into account that we have imbalanced number of each sex level (96 females , 207 males). in contrast, if we look only at females we can say that a particular female are more likely to have the disease than not.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;ggplot(mydata,aes(cp, fill = target))+
  geom_bar(stat = &amp;quot;count&amp;quot;, position = &amp;quot;dodge&amp;quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://modelingwithr.rbind.io/post/logimodel/logimodel_files/figure-html/unnamed-chunk-7-1.svg&#34; width=&#34;576&#34; /&gt;&lt;/p&gt;
&lt;p&gt;From this plot we can conclude that if the patient does not have any chest pain he/she will be highly unlikely to get the disease, otherwise for any chest type the patient will be more likely to be pathologique by this disease. we can expect therfore that this predictor will have a significant importance on the training model.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;ggplot(mydata, aes(age,fill=target))+
  geom_density(alpha=.5)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://modelingwithr.rbind.io/post/logimodel/logimodel_files/figure-html/unnamed-chunk-8-1.svg&#34; width=&#34;576&#34; /&gt;&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;data-partition&#34; class=&#34;section level1&#34; number=&#34;3&#34;&gt;
&lt;h1&gt;&lt;span class=&#34;header-section-number&#34;&gt;3&lt;/span&gt; Data partition&lt;/h1&gt;
&lt;p&gt;we take out 80% of the data to use as training set and the rest will be put aside to evaluate the model performance.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;set.seed(1234)
index&amp;lt;-createDataPartition(mydata$target, p=.8,list=FALSE)
train&amp;lt;-mydata[index,]
test&amp;lt;-mydata[-index,]&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div id=&#34;train-the-model&#34; class=&#34;section level1&#34; number=&#34;4&#34;&gt;
&lt;h1&gt;&lt;span class=&#34;header-section-number&#34;&gt;4&lt;/span&gt; train the model&lt;/h1&gt;
&lt;p&gt;We are now ready to train our model.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;model &amp;lt;- glm(target~., data=train,family = &amp;quot;binomial&amp;quot;)
summary(model)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## 
## Call:
## glm(formula = target ~ ., family = &amp;quot;binomial&amp;quot;, data = train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.5855  -0.5294   0.1990   0.6120   2.4022  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(&amp;gt;|z|)    
## (Intercept)  3.715274   2.883238   1.289 0.197545    
## age         -0.014712   0.023285  -0.632 0.527502    
## sex1        -1.686359   0.479254  -3.519 0.000434 ***
## cp1          1.212919   0.549670   2.207 0.027340 *  
## cp2          2.010255   0.486638   4.131 3.61e-05 ***
## cp3          2.139066   0.682727   3.133 0.001730 ** 
## trestbps    -0.020471   0.012195  -1.679 0.093220 .  
## chol        -0.005840   0.003776  -1.547 0.121959    
## fbs1        -0.200690   0.519116  -0.387 0.699053    
## thalach      0.024461   0.010928   2.238 0.025196 *  
## exang1      -0.792717   0.431434  -1.837 0.066151 .  
## oldpeak     -0.820508   0.231100  -3.550 0.000385 ***
## slope1      -0.999768   1.015514  -0.984 0.324872    
## slope2      -0.767247   1.097448  -0.699 0.484477    
## ---
## Signif. codes:  0 &amp;#39;***&amp;#39; 0.001 &amp;#39;**&amp;#39; 0.01 &amp;#39;*&amp;#39; 0.05 &amp;#39;.&amp;#39; 0.1 &amp;#39; &amp;#39; 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 335.05  on 242  degrees of freedom
## Residual deviance: 191.33  on 229  degrees of freedom
## AIC: 219.33
## 
## Number of Fisher Scoring iterations: 5&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;we see that some variables are not significant using p-value such as &lt;strong&gt;age&lt;/strong&gt;, &lt;strong&gt;chol&lt;/strong&gt;,&lt;strong&gt;fbs&lt;/strong&gt;,&lt;strong&gt;slope&lt;/strong&gt;, and also the intercept. First let’s remove the insignificant factor variables &lt;strong&gt;fbs&lt;/strong&gt; and &lt;strong&gt;slope&lt;/strong&gt;.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;model &amp;lt;- glm(target~.-fbs-slope, data=train,family = &amp;quot;binomial&amp;quot;)
summary(model)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## 
## Call:
## glm(formula = target ~ . - fbs - slope, family = &amp;quot;binomial&amp;quot;, 
##     data = train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.6702  -0.5505   0.1993   0.6344   2.4495  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(&amp;gt;|z|)    
## (Intercept)  2.826395   2.695175   1.049 0.294322    
## age         -0.016677   0.023157  -0.720 0.471420    
## sex1        -1.729320   0.470656  -3.674 0.000239 ***
## cp1          1.243879   0.548288   2.269 0.023289 *  
## cp2          1.987151   0.472994   4.201 2.65e-05 ***
## cp3          2.125766   0.677257   3.139 0.001696 ** 
## trestbps    -0.020672   0.012005  -1.722 0.085084 .  
## chol        -0.006434   0.003721  -1.729 0.083816 .  
## thalach      0.026567   0.010432   2.547 0.010873 *  
## exang1      -0.848162   0.423189  -2.004 0.045047 *  
## oldpeak     -0.798699   0.198597  -4.022 5.78e-05 ***
## ---
## Signif. codes:  0 &amp;#39;***&amp;#39; 0.001 &amp;#39;**&amp;#39; 0.01 &amp;#39;*&amp;#39; 0.05 &amp;#39;.&amp;#39; 0.1 &amp;#39; &amp;#39; 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 335.05  on 242  degrees of freedom
## Residual deviance: 192.66  on 232  degrees of freedom
## AIC: 214.66
## 
## Number of Fisher Scoring iterations: 5&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Now we remove the &lt;strong&gt;age&lt;/strong&gt; variable since it is the least significance.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;model &amp;lt;- glm(target~.-fbs-slope-age, data=train,family = &amp;quot;binomial&amp;quot;)
summary(model)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## 
## Call:
## glm(formula = target ~ . - fbs - slope - age, family = &amp;quot;binomial&amp;quot;, 
##     data = train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.6925  -0.5397   0.2032   0.6345   2.4032  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(&amp;gt;|z|)    
## (Intercept)  1.703126   2.188741   0.778 0.436492    
## sex1        -1.677986   0.463447  -3.621 0.000294 ***
## cp1          1.221925   0.545175   2.241 0.025004 *  
## cp2          1.961200   0.468443   4.187 2.83e-05 ***
## cp3          2.085409   0.676469   3.083 0.002051 ** 
## trestbps    -0.022133   0.011872  -1.864 0.062273 .  
## chol        -0.006900   0.003675  -1.878 0.060443 .  
## thalach      0.029761   0.009471   3.142 0.001676 ** 
## exang1      -0.820113   0.420434  -1.951 0.051101 .  
## oldpeak     -0.803423   0.198400  -4.050 5.13e-05 ***
## ---
## Signif. codes:  0 &amp;#39;***&amp;#39; 0.001 &amp;#39;**&amp;#39; 0.01 &amp;#39;*&amp;#39; 0.05 &amp;#39;.&amp;#39; 0.1 &amp;#39; &amp;#39; 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 335.05  on 242  degrees of freedom
## Residual deviance: 193.19  on 233  degrees of freedom
## AIC: 213.19
## 
## Number of Fisher Scoring iterations: 5&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;we remove now the variables &lt;strong&gt;exang&lt;/strong&gt;.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;model &amp;lt;- glm(target~.-fbs-slope-age-exang, data=train,family = &amp;quot;binomial&amp;quot;)
summary(model)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## 
## Call:
## glm(formula = target ~ . - fbs - slope - age - exang, family = &amp;quot;binomial&amp;quot;, 
##     data = train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.7030  -0.5643   0.2004   0.6510   2.5728  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(&amp;gt;|z|)    
## (Intercept)  0.832691   2.105139   0.396 0.692436    
## sex1        -1.713577   0.459659  -3.728 0.000193 ***
## cp1          1.494091   0.528172   2.829 0.004672 ** 
## cp2          2.205121   0.454341   4.853 1.21e-06 ***
## cp3          2.220423   0.668760   3.320 0.000899 ***
## trestbps    -0.021812   0.011704  -1.864 0.062375 .  
## chol        -0.007110   0.003597  -1.977 0.048054 *  
## thalach      0.033412   0.009291   3.596 0.000323 ***
## oldpeak     -0.822277   0.195993  -4.195 2.72e-05 ***
## ---
## Signif. codes:  0 &amp;#39;***&amp;#39; 0.001 &amp;#39;**&amp;#39; 0.01 &amp;#39;*&amp;#39; 0.05 &amp;#39;.&amp;#39; 0.1 &amp;#39; &amp;#39; 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 335.05  on 242  degrees of freedom
## Residual deviance: 196.98  on 234  degrees of freedom
## AIC: 214.98
## 
## Number of Fisher Scoring iterations: 5&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Notice that we can not remove intercept even it is not significant because it contains the first level of “0” of the factor &lt;strong&gt;cp&lt;/strong&gt; which is significant. This is hence our final model.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;prediction-and-confusion-matrix&#34; class=&#34;section level1&#34; number=&#34;5&#34;&gt;
&lt;h1&gt;&lt;span class=&#34;header-section-number&#34;&gt;5&lt;/span&gt; prediction and confusion matrix&lt;/h1&gt;
&lt;p&gt;we will use this model to predict the training set.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;pred &amp;lt;- predict(model,train, type=&amp;quot;response&amp;quot;)
head(pred)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;##         2         3         4         6         7         8 
## 0.5202639 0.9331630 0.8330192 0.3354247 0.7730621 0.8705651&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;using the confusion matrix we get the accuracy rate in the training set.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;pred &amp;lt;- as.integer(pred&amp;gt;0.5)
confusionMatrix(as.factor(pred),train$target, positive = &amp;quot;1&amp;quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0  87  17
##          1  24 115
##                                           
##                Accuracy : 0.8313          
##                  95% CI : (0.7781, 0.8761)
##     No Information Rate : 0.5432          
##     P-Value [Acc &amp;gt; NIR] : &amp;lt;2e-16          
##                                           
##                   Kappa : 0.6583          
##                                           
##  Mcnemar&amp;#39;s Test P-Value : 0.3487          
##                                           
##             Sensitivity : 0.8712          
##             Specificity : 0.7838          
##          Pos Pred Value : 0.8273          
##          Neg Pred Value : 0.8365          
##              Prevalence : 0.5432          
##          Detection Rate : 0.4733          
##    Detection Prevalence : 0.5720          
##       Balanced Accuracy : 0.8275          
##                                           
##        &amp;#39;Positive&amp;#39; Class : 1               
## &lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;In the training set the accuracy rate is about 83,13% . But we are more intrested in the accuracy of the test set.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;pred &amp;lt;- predict(model,test, type=&amp;quot;response&amp;quot;)
pred &amp;lt;- as.integer(pred&amp;gt;0.5)
confusionMatrix(as.factor(pred),test$target)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 16  3
##          1 11 30
##                                           
##                Accuracy : 0.7667          
##                  95% CI : (0.6396, 0.8662)
##     No Information Rate : 0.55            
##     P-Value [Acc &amp;gt; NIR] : 0.0004231       
##                                           
##                   Kappa : 0.5156          
##                                           
##  Mcnemar&amp;#39;s Test P-Value : 0.0613688       
##                                           
##             Sensitivity : 0.5926          
##             Specificity : 0.9091          
##          Pos Pred Value : 0.8421          
##          Neg Pred Value : 0.7317          
##              Prevalence : 0.4500          
##          Detection Rate : 0.2667          
##    Detection Prevalence : 0.3167          
##       Balanced Accuracy : 0.7508          
##                                           
##        &amp;#39;Positive&amp;#39; Class : 0               
## &lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;With the test set we have lower accuracy rate about 76.67%.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;the-link-function&#34; class=&#34;section level1&#34; number=&#34;6&#34;&gt;
&lt;h1&gt;&lt;span class=&#34;header-section-number&#34;&gt;6&lt;/span&gt; The link function&lt;/h1&gt;
&lt;p&gt;By default the link function is &lt;strong&gt;logit&lt;/strong&gt; from the sigmoid distribution, we can however make use of the link function &lt;strong&gt;probit&lt;/strong&gt; instead, which stands for the normal distribution.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;model1 &amp;lt;- glm(target~.-fbs-slope-exang-age, data=train,
             family = binomial(link = &amp;quot;probit&amp;quot;))
summary(model1)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## 
## Call:
## glm(formula = target ~ . - fbs - slope - exang - age, family = binomial(link = &amp;quot;probit&amp;quot;), 
##     data = train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.7779  -0.5883   0.1666   0.6670   2.5989  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(&amp;gt;|z|)    
## (Intercept)  0.373007   1.199910   0.311 0.755905    
## sex1        -0.940784   0.252631  -3.724 0.000196 ***
## cp1          0.830588   0.299919   2.769 0.005616 ** 
## cp2          1.275100   0.253681   5.026 5.00e-07 ***
## cp3          1.262407   0.387479   3.258 0.001122 ** 
## trestbps    -0.011677   0.006660  -1.753 0.079549 .  
## chol        -0.004068   0.002047  -1.987 0.046870 *  
## thalach      0.018999   0.005163   3.680 0.000233 ***
## oldpeak     -0.470191   0.108935  -4.316 1.59e-05 ***
## ---
## Signif. codes:  0 &amp;#39;***&amp;#39; 0.001 &amp;#39;**&amp;#39; 0.01 &amp;#39;*&amp;#39; 0.05 &amp;#39;.&amp;#39; 0.1 &amp;#39; &amp;#39; 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 335.05  on 242  degrees of freedom
## Residual deviance: 197.23  on 234  degrees of freedom
## AIC: 215.23
## 
## Number of Fisher Scoring iterations: 6&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;pred &amp;lt;- predict(model,test, type=&amp;quot;response&amp;quot;)
pred &amp;lt;- as.integer(pred&amp;gt;0.5)
confusionMatrix(as.factor(pred),test$target)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 16  3
##          1 11 30
##                                           
##                Accuracy : 0.7667          
##                  95% CI : (0.6396, 0.8662)
##     No Information Rate : 0.55            
##     P-Value [Acc &amp;gt; NIR] : 0.0004231       
##                                           
##                   Kappa : 0.5156          
##                                           
##  Mcnemar&amp;#39;s Test P-Value : 0.0613688       
##                                           
##             Sensitivity : 0.5926          
##             Specificity : 0.9091          
##          Pos Pred Value : 0.8421          
##          Neg Pred Value : 0.7317          
##              Prevalence : 0.4500          
##          Detection Rate : 0.2667          
##    Detection Prevalence : 0.3167          
##       Balanced Accuracy : 0.7508          
##                                           
##        &amp;#39;Positive&amp;#39; Class : 0               
## &lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;As we see we get the same results with a slight difference between the &lt;strong&gt;AIC&lt;/strong&gt; criterion &lt;strong&gt;215.54&lt;/strong&gt; for &lt;strong&gt;probit&lt;/strong&gt; link and &lt;strong&gt;214.98&lt;/strong&gt; for &lt;strong&gt;logit&lt;/strong&gt; link.&lt;/p&gt;
&lt;/div&gt;
</description>
    </item>
    
    <item>
      <title>naive bayes</title>
      <link>https://modelingwithr.rbind.io/post/naivemodel/</link>
      <pubDate>Thu, 19 Dec 2019 00:00:00 +0000</pubDate>
      <guid>https://modelingwithr.rbind.io/post/naivemodel/</guid>
      <description>
&lt;script src=&#34;https://modelingwithr.rbind.io/rmarkdown-libs/header-attrs/header-attrs.js&#34;&gt;&lt;/script&gt;

&lt;div id=&#34;TOC&#34;&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&#34;#introduction&#34;&gt;&lt;span class=&#34;toc-section-number&#34;&gt;1&lt;/span&gt; Introduction&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#data-preparation&#34;&gt;&lt;span class=&#34;toc-section-number&#34;&gt;2&lt;/span&gt; Data preparation&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#data-partition&#34;&gt;&lt;span class=&#34;toc-section-number&#34;&gt;3&lt;/span&gt; Data partition&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#model-training&#34;&gt;&lt;span class=&#34;toc-section-number&#34;&gt;4&lt;/span&gt; Model training&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#model-evaluation&#34;&gt;&lt;span class=&#34;toc-section-number&#34;&gt;5&lt;/span&gt; Model evaluation&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#model-fine-tuning&#34;&gt;&lt;span class=&#34;toc-section-number&#34;&gt;6&lt;/span&gt; Model fine-tuning:&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#conclusion&#34;&gt;&lt;span class=&#34;toc-section-number&#34;&gt;7&lt;/span&gt; Conclusion&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;

&lt;style type=&#34;text/css&#34;&gt;
strong {
  color: Navy;
}

h1,h2, h3, h4 {
  font-size:28px;
  color:DarkBlue;
}
&lt;/style&gt;
&lt;div id=&#34;introduction&#34; class=&#34;section level1&#34; number=&#34;1&#34;&gt;
&lt;h1&gt;&lt;span class=&#34;header-section-number&#34;&gt;1&lt;/span&gt; Introduction&lt;/h1&gt;
&lt;p&gt;&lt;strong&gt;Naive bayes&lt;/strong&gt; model based on a strong assumption that the features are &lt;strong&gt;conditionally independent&lt;/strong&gt; given the class label. Since this assumption is rarely when it is true, this model termed as &lt;strong&gt;naive&lt;/strong&gt;. However, even this assumption is not satisfied the model still works very well (Kevin.P murphy 2012). Using this assumption we can define the class conditionall density as the product of one dimensional densities.&lt;/p&gt;
&lt;p&gt;&lt;span class=&#34;math display&#34;&gt;\[p(X|y=c,\theta)=\prod_{j=1}^Dp(x_j|y=c,\theta_{jc})\]&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;The possible one dimensional density for each feature depends on the type of the feature:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;For real_valued features we can make use of gaussion distribution:&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;span class=&#34;math display&#34;&gt;\[p(X|y=c,\theta)=\prod_{j=1}^D\mathcal N(\mu_{jc}|y=c,\sigma_{jc}^2)\]&lt;/span&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;For binary feature we can use bernouli distribution:&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;span class=&#34;math display&#34;&gt;\[p(X|y=c,\theta)=\prod_{j=1}^DBer(x_j|\mu_{jc})\]&lt;/span&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;For categorical feature we can make use of multinouli distribution:&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;span class=&#34;math display&#34;&gt;\[p(X|y=c,\theta)=\prod_{j=1}^DCat(x_j|\mu_{jc})\]&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;For data that has features of different types we can use a mixture product of the above distributions, and this is what we will do in this paper.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;data-preparation&#34; class=&#34;section level1&#34; number=&#34;2&#34;&gt;
&lt;h1&gt;&lt;span class=&#34;header-section-number&#34;&gt;2&lt;/span&gt; Data preparation&lt;/h1&gt;
&lt;p&gt;The data that we will use here is &lt;a href=&#34;https://www.kaggle.com/johnsmith88/heart-disease-dataset&#34;&gt;uploaded from kaggle website&lt;/a&gt;, which is about heart disease.
let us start by calling the packages needed and the data, then we give an appropriate name to the first column&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;library(tidyverse)
library(caret)
mydata&amp;lt;-read.csv(&amp;quot;heart.csv&amp;quot;,header = TRUE)
names(mydata)[1]&amp;lt;-&amp;quot;age&amp;quot;
glimpse(mydata)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## Rows: 303
## Columns: 14
## $ age      &amp;lt;int&amp;gt; 63, 37, 41, 56, 57, 57, 56, 44, 52, 57, 54, 48, 49, 64, 58...
## $ sex      &amp;lt;int&amp;gt; 1, 1, 0, 1, 0, 1, 0, 1, 1, 1, 1, 0, 1, 1, 0, 0, 0, 0, 1, 0...
## $ cp       &amp;lt;int&amp;gt; 3, 2, 1, 1, 0, 0, 1, 1, 2, 2, 0, 2, 1, 3, 3, 2, 2, 3, 0, 3...
## $ trestbps &amp;lt;int&amp;gt; 145, 130, 130, 120, 120, 140, 140, 120, 172, 150, 140, 130...
## $ chol     &amp;lt;int&amp;gt; 233, 250, 204, 236, 354, 192, 294, 263, 199, 168, 239, 275...
## $ fbs      &amp;lt;int&amp;gt; 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0...
## $ restecg  &amp;lt;int&amp;gt; 0, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1...
## $ thalach  &amp;lt;int&amp;gt; 150, 187, 172, 178, 163, 148, 153, 173, 162, 174, 160, 139...
## $ exang    &amp;lt;int&amp;gt; 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0...
## $ oldpeak  &amp;lt;dbl&amp;gt; 2.3, 3.5, 1.4, 0.8, 0.6, 0.4, 1.3, 0.0, 0.5, 1.6, 1.2, 0.2...
## $ slope    &amp;lt;int&amp;gt; 0, 0, 2, 2, 2, 1, 1, 2, 2, 2, 2, 2, 2, 1, 2, 1, 2, 0, 2, 2...
## $ ca       &amp;lt;int&amp;gt; 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2...
## $ thal     &amp;lt;int&amp;gt; 1, 2, 2, 2, 2, 1, 2, 3, 3, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2...
## $ target   &amp;lt;int&amp;gt; 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;the &lt;strong&gt;target&lt;/strong&gt; variable indicates whether a patient has the disease or not based on the following features:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;age.&lt;/li&gt;
&lt;li&gt;sex: 1=male,0=female&lt;/li&gt;
&lt;li&gt;cp : chest pain type.&lt;/li&gt;
&lt;li&gt;trestbps : resting blood pressure.&lt;/li&gt;
&lt;li&gt;chol: serum cholestoral.&lt;/li&gt;
&lt;li&gt;fbs : fasting blood sugar.&lt;/li&gt;
&lt;li&gt;restecg : resting electrocardiographic results.&lt;/li&gt;
&lt;li&gt;thalach : maximum heart rate achieved&lt;/li&gt;
&lt;li&gt;exang : exercise induced angina.&lt;/li&gt;
&lt;li&gt;oldpeak : ST depression induced by exercise relative to rest.&lt;/li&gt;
&lt;li&gt;slope : the slope of the peak exercise ST segment.&lt;/li&gt;
&lt;li&gt;ca : number of major vessels colored by flourosopy.&lt;/li&gt;
&lt;li&gt;thal : it is not well defined from the data source.&lt;/li&gt;
&lt;li&gt;target: have heart disease or not.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The most intuitive thing by which we start our analysis is by getting the summary of this data to check the range, the five quantiles, and the existance or not of missing values for each feature.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;summary(mydata)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;##       age             sex               cp           trestbps    
##  Min.   :29.00   Min.   :0.0000   Min.   :0.000   Min.   : 94.0  
##  1st Qu.:47.50   1st Qu.:0.0000   1st Qu.:0.000   1st Qu.:120.0  
##  Median :55.00   Median :1.0000   Median :1.000   Median :130.0  
##  Mean   :54.37   Mean   :0.6832   Mean   :0.967   Mean   :131.6  
##  3rd Qu.:61.00   3rd Qu.:1.0000   3rd Qu.:2.000   3rd Qu.:140.0  
##  Max.   :77.00   Max.   :1.0000   Max.   :3.000   Max.   :200.0  
##       chol            fbs            restecg          thalach     
##  Min.   :126.0   Min.   :0.0000   Min.   :0.0000   Min.   : 71.0  
##  1st Qu.:211.0   1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:133.5  
##  Median :240.0   Median :0.0000   Median :1.0000   Median :153.0  
##  Mean   :246.3   Mean   :0.1485   Mean   :0.5281   Mean   :149.6  
##  3rd Qu.:274.5   3rd Qu.:0.0000   3rd Qu.:1.0000   3rd Qu.:166.0  
##  Max.   :564.0   Max.   :1.0000   Max.   :2.0000   Max.   :202.0  
##      exang           oldpeak         slope             ca        
##  Min.   :0.0000   Min.   :0.00   Min.   :0.000   Min.   :0.0000  
##  1st Qu.:0.0000   1st Qu.:0.00   1st Qu.:1.000   1st Qu.:0.0000  
##  Median :0.0000   Median :0.80   Median :1.000   Median :0.0000  
##  Mean   :0.3267   Mean   :1.04   Mean   :1.399   Mean   :0.7294  
##  3rd Qu.:1.0000   3rd Qu.:1.60   3rd Qu.:2.000   3rd Qu.:1.0000  
##  Max.   :1.0000   Max.   :6.20   Max.   :2.000   Max.   :4.0000  
##       thal           target      
##  Min.   :0.000   Min.   :0.0000  
##  1st Qu.:2.000   1st Qu.:0.0000  
##  Median :2.000   Median :1.0000  
##  Mean   :2.314   Mean   :0.5446  
##  3rd Qu.:3.000   3rd Qu.:1.0000  
##  Max.   :3.000   Max.   :1.0000&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;After inspecting the features we see that Some variables should be treated as factors rather than numerics such as &lt;strong&gt;sex&lt;/strong&gt;, &lt;strong&gt;cp&lt;/strong&gt;, &lt;strong&gt;fbs&lt;/strong&gt;, &lt;strong&gt;restecg&lt;/strong&gt;, &lt;strong&gt;exange&lt;/strong&gt;, &lt;strong&gt;slope&lt;/strong&gt;, &lt;strong&gt;ca&lt;/strong&gt;, &lt;strong&gt;thal&lt;/strong&gt;, and the &lt;strong&gt;target&lt;/strong&gt; variable, hence they will be converted to factor type as follows:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;mydata&amp;lt;-mydata %&amp;gt;%
  mutate_at(c(2,3,6,7,9,11,12,13,14),funs(as.factor))
summary(mydata)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;##       age        sex     cp         trestbps          chol       fbs    
##  Min.   :29.00   0: 96   0:143   Min.   : 94.0   Min.   :126.0   0:258  
##  1st Qu.:47.50   1:207   1: 50   1st Qu.:120.0   1st Qu.:211.0   1: 45  
##  Median :55.00           2: 87   Median :130.0   Median :240.0          
##  Mean   :54.37           3: 23   Mean   :131.6   Mean   :246.3          
##  3rd Qu.:61.00                   3rd Qu.:140.0   3rd Qu.:274.5          
##  Max.   :77.00                   Max.   :200.0   Max.   :564.0          
##  restecg    thalach      exang      oldpeak     slope   ca      thal    target 
##  0:147   Min.   : 71.0   0:204   Min.   :0.00   0: 21   0:175   0:  2   0:138  
##  1:152   1st Qu.:133.5   1: 99   1st Qu.:0.00   1:140   1: 65   1: 18   1:165  
##  2:  4   Median :153.0           Median :0.80   2:142   2: 38   2:166          
##          Mean   :149.6           Mean   :1.04           3: 20   3:117          
##          3rd Qu.:166.0           3rd Qu.:1.60           4:  5                  
##          Max.   :202.0           Max.   :6.20&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;In practice It is very usefull to inspect (by traditional statistic test such as &lt;strong&gt;kisq&lt;/strong&gt; or correlation coefficient) the relationships between the target variable and each of the potential explanatory variables before building any model, doing so we can tell apart the relevant variables from the irrelvant ones and hence which of which should include in our model.
Another important issue with factors is that when spliting the data between training set and testing set some factor level can be missing in one set if the the number of casses for that level is too small.&lt;br /&gt;
let’s check if all the factor levels contribute on each target variable level.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;xtabs(~target+sex,data=mydata)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;##       sex
## target   0   1
##      0  24 114
##      1  72  93&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;xtabs(~target+cp,data=mydata)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;##       cp
## target   0   1   2   3
##      0 104   9  18   7
##      1  39  41  69  16&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;xtabs(~target+fbs,data=mydata)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;##       fbs
## target   0   1
##      0 116  22
##      1 142  23&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;xtabs(~target+restecg,data=mydata)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;##       restecg
## target  0  1  2
##      0 79 56  3
##      1 68 96  1&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;xtabs(~target+exang,data=mydata)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;##       exang
## target   0   1
##      0  62  76
##      1 142  23&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;xtabs(~target+slope,data=mydata)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;##       slope
## target   0   1   2
##      0  12  91  35
##      1   9  49 107&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;xtabs(~target+ca,data=mydata)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;##       ca
## target   0   1   2   3   4
##      0  45  44  31  17   1
##      1 130  21   7   3   4&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;xtabs(~target+thal,data=mydata)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;##       thal
## target   0   1   2   3
##      0   1  12  36  89
##      1   1   6 130  28&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;As we see the &lt;strong&gt;restecg&lt;/strong&gt;,&lt;strong&gt;ca&lt;/strong&gt; and &lt;strong&gt;thal&lt;/strong&gt; variables have values less than the threshold of 5 casses required, so if we split the data between training set and test set the level &lt;strong&gt;2&lt;/strong&gt; of the &lt;strong&gt;restecg&lt;/strong&gt; variable will not be found in one of the sets since we have only one case. Therfore we should remove these variables from the model.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;mydata&amp;lt;-mydata[,-c(7,12,13)]
glimpse(mydata)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## Rows: 303
## Columns: 11
## $ age      &amp;lt;int&amp;gt; 63, 37, 41, 56, 57, 57, 56, 44, 52, 57, 54, 48, 49, 64, 58...
## $ sex      &amp;lt;fct&amp;gt; 1, 1, 0, 1, 0, 1, 0, 1, 1, 1, 1, 0, 1, 1, 0, 0, 0, 0, 1, 0...
## $ cp       &amp;lt;fct&amp;gt; 3, 2, 1, 1, 0, 0, 1, 1, 2, 2, 0, 2, 1, 3, 3, 2, 2, 3, 0, 3...
## $ trestbps &amp;lt;int&amp;gt; 145, 130, 130, 120, 120, 140, 140, 120, 172, 150, 140, 130...
## $ chol     &amp;lt;int&amp;gt; 233, 250, 204, 236, 354, 192, 294, 263, 199, 168, 239, 275...
## $ fbs      &amp;lt;fct&amp;gt; 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0...
## $ thalach  &amp;lt;int&amp;gt; 150, 187, 172, 178, 163, 148, 153, 173, 162, 174, 160, 139...
## $ exang    &amp;lt;fct&amp;gt; 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0...
## $ oldpeak  &amp;lt;dbl&amp;gt; 2.3, 3.5, 1.4, 0.8, 0.6, 0.4, 1.3, 0.0, 0.5, 1.6, 1.2, 0.2...
## $ slope    &amp;lt;fct&amp;gt; 0, 0, 2, 2, 2, 1, 1, 2, 2, 2, 2, 2, 2, 1, 2, 1, 2, 0, 2, 2...
## $ target   &amp;lt;fct&amp;gt; 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Before training our model, we can get a vague insight about the predictors that have some importance for the prediction of the dependent variable.&lt;/p&gt;
&lt;p&gt;Let’s plot the relationships between the target variabl and the other features.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;ggplot(mydata,aes(sex,target,color=target))+
  geom_jitter()&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://modelingwithr.rbind.io/post/naivemodel_files/figure-html/unnamed-chunk-7-1.svg&#34; width=&#34;576&#34; /&gt;&lt;/p&gt;
&lt;p&gt;If we look only at the red points (healthy patients) we can wrongly interpret that females are less healthy than males. This is because we do not take into account that we have imbalanced number of each sex level (96 females , 207 males). in contrast, if we look only at females we can say that a particular female are more likely to have the disease than not.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;ggplot(mydata,aes(cp,fill= target))+
  geom_bar(stat = &amp;quot;count&amp;quot;,position = &amp;quot;dodge&amp;quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://modelingwithr.rbind.io/post/naivemodel_files/figure-html/unnamed-chunk-8-1.svg&#34; width=&#34;576&#34; /&gt;&lt;/p&gt;
&lt;p&gt;From this plot we can conclude that if the patient does not have any chest pain he/she will be highly unlikely to get the disease, otherwise for any chest type the patient will be more likely to be pathologique by this disease. we can expect therfore that this predictor will have a significant importance on the training model.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;ggplot(mydata, aes(age,fill=target))+
  geom_density(alpha=.5)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://modelingwithr.rbind.io/post/naivemodel_files/figure-html/unnamed-chunk-9-1.svg&#34; width=&#34;576&#34; /&gt;&lt;/p&gt;
&lt;p&gt;Even there exist a large amount of overlapping between the two densities which may violate the independence assumption, it still exist some difference since these are drawn from the sample not the from the true distributions. However, we do not care much about it since we will evaluate the resulted model by using the testing set.&lt;br /&gt;
we can also check this assumption with the corralation matrix.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;library(psych)
pairs.panels(mydata[,-11])&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://modelingwithr.rbind.io/post/naivemodel_files/figure-html/unnamed-chunk-10-1.svg&#34; width=&#34;576&#34; /&gt;&lt;/p&gt;
&lt;p&gt;AS we see all the correlations are less than 50% so we can go ahead and train our model.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;data-partition&#34; class=&#34;section level1&#34; number=&#34;3&#34;&gt;
&lt;h1&gt;&lt;span class=&#34;header-section-number&#34;&gt;3&lt;/span&gt; Data partition&lt;/h1&gt;
&lt;p&gt;we take out 80% of the data to use as training set and the rest will be put aside to evaluate the model performance.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;set.seed(1234)
index&amp;lt;-createDataPartition(mydata$target, p=.8,list=FALSE)
train&amp;lt;-mydata[index,]
test&amp;lt;-mydata[-index,]&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div id=&#34;model-training&#34; class=&#34;section level1&#34; number=&#34;4&#34;&gt;
&lt;h1&gt;&lt;span class=&#34;header-section-number&#34;&gt;4&lt;/span&gt; Model training&lt;/h1&gt;
&lt;p&gt;Note: for this model we do not need to set seed because this model uses known densities for the predictors and does not use any random method.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;library(naivebayes)
modelnv&amp;lt;-naive_bayes(target~.,data=train)
modelnv&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## 
## ================================== Naive Bayes ================================== 
##  
##  Call: 
## naive_bayes.formula(formula = target ~ ., data = train)
## 
## --------------------------------------------------------------------------------- 
##  
## Laplace smoothing: 0
## 
## --------------------------------------------------------------------------------- 
##  
##  A priori probabilities: 
## 
##         0         1 
## 0.4567901 0.5432099 
## 
## --------------------------------------------------------------------------------- 
##  
##  Tables: 
## 
## --------------------------------------------------------------------------------- 
##  ::: age (Gaussian) 
## --------------------------------------------------------------------------------- 
##       
## age            0         1
##   mean 56.432432 52.378788
##   sd    8.410623  9.896819
## 
## --------------------------------------------------------------------------------- 
##  ::: sex (Bernoulli) 
## --------------------------------------------------------------------------------- 
##    
## sex         0         1
##   0 0.1891892 0.3939394
##   1 0.8108108 0.6060606
## 
## --------------------------------------------------------------------------------- 
##  ::: cp (Categorical) 
## --------------------------------------------------------------------------------- 
##    
## cp           0          1
##   0 0.75675676 0.22727273
##   1 0.07207207 0.25000000
##   2 0.12612613 0.42424242
##   3 0.04504505 0.09848485
## 
## --------------------------------------------------------------------------------- 
##  ::: trestbps (Gaussian) 
## --------------------------------------------------------------------------------- 
##         
## trestbps         0         1
##     mean 133.82883 128.75758
##     sd    18.26267  15.21857
## 
## --------------------------------------------------------------------------------- 
##  ::: chol (Gaussian) 
## --------------------------------------------------------------------------------- 
##       
## chol           0         1
##   mean 248.52252 240.80303
##   sd    51.07194  53.55705
## 
## ---------------------------------------------------------------------------------
## 
## # ... and 5 more tables
## 
## ---------------------------------------------------------------------------------&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;As we see each predictor is treated depending on its type, gaussion distribution for numeric variables, bernouli distribution for binary variables and multinouli distribution for categorical variables.&lt;/p&gt;
&lt;p&gt;all the informations about this model can be extracted using the function &lt;strong&gt;attributes&lt;/strong&gt;.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;attributes(modelnv)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## $names
## [1] &amp;quot;data&amp;quot;       &amp;quot;levels&amp;quot;     &amp;quot;laplace&amp;quot;    &amp;quot;tables&amp;quot;     &amp;quot;prior&amp;quot;     
## [6] &amp;quot;usekernel&amp;quot;  &amp;quot;usepoisson&amp;quot; &amp;quot;call&amp;quot;      
## 
## $class
## [1] &amp;quot;naive_bayes&amp;quot;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;we can visualize the above reults with the fuction plot that provides us by plot the distribution of each features, densities for numeric features and bars for factors. .&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;plot(modelnv)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://modelingwithr.rbind.io/post/naivemodel_files/figure-html/unnamed-chunk-14-1.svg&#34; width=&#34;576&#34; /&gt;&lt;img src=&#34;https://modelingwithr.rbind.io/post/naivemodel_files/figure-html/unnamed-chunk-14-2.svg&#34; width=&#34;576&#34; /&gt;&lt;img src=&#34;https://modelingwithr.rbind.io/post/naivemodel_files/figure-html/unnamed-chunk-14-3.svg&#34; width=&#34;576&#34; /&gt;&lt;img src=&#34;https://modelingwithr.rbind.io/post/naivemodel_files/figure-html/unnamed-chunk-14-4.svg&#34; width=&#34;576&#34; /&gt;&lt;img src=&#34;https://modelingwithr.rbind.io/post/naivemodel_files/figure-html/unnamed-chunk-14-5.svg&#34; width=&#34;576&#34; /&gt;&lt;img src=&#34;https://modelingwithr.rbind.io/post/naivemodel_files/figure-html/unnamed-chunk-14-6.svg&#34; width=&#34;576&#34; /&gt;&lt;img src=&#34;https://modelingwithr.rbind.io/post/naivemodel_files/figure-html/unnamed-chunk-14-7.svg&#34; width=&#34;576&#34; /&gt;&lt;img src=&#34;https://modelingwithr.rbind.io/post/naivemodel_files/figure-html/unnamed-chunk-14-8.svg&#34; width=&#34;576&#34; /&gt;&lt;img src=&#34;https://modelingwithr.rbind.io/post/naivemodel_files/figure-html/unnamed-chunk-14-9.svg&#34; width=&#34;576&#34; /&gt;&lt;img src=&#34;https://modelingwithr.rbind.io/post/naivemodel_files/figure-html/unnamed-chunk-14-10.svg&#34; width=&#34;576&#34; /&gt;&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;model-evaluation&#34; class=&#34;section level1&#34; number=&#34;5&#34;&gt;
&lt;h1&gt;&lt;span class=&#34;header-section-number&#34;&gt;5&lt;/span&gt; Model evaluation&lt;/h1&gt;
&lt;p&gt;We can check the accuracy of the training data of this model using the confusion matrix.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;pred&amp;lt;-predict(modelnv,train)
confusionMatrix(pred,train$target)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0  86  24
##          1  25 108
##                                           
##                Accuracy : 0.7984          
##                  95% CI : (0.7423, 0.8469)
##     No Information Rate : 0.5432          
##     P-Value [Acc &amp;gt; NIR] : &amp;lt;2e-16          
##                                           
##                   Kappa : 0.5934          
##                                           
##  Mcnemar&amp;#39;s Test P-Value : 1               
##                                           
##             Sensitivity : 0.7748          
##             Specificity : 0.8182          
##          Pos Pred Value : 0.7818          
##          Neg Pred Value : 0.8120          
##              Prevalence : 0.4568          
##          Detection Rate : 0.3539          
##    Detection Prevalence : 0.4527          
##       Balanced Accuracy : 0.7965          
##                                           
##        &amp;#39;Positive&amp;#39; Class : 0               
## &lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The accuracy rate of the training set is about 79.84%.
as expected the specificity rate (81.82%) for class 1 is much larger than the snesitivity rate (77.48) for class 0. This is reflectd by the fact that we have larger number of class 1 than class 0.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;print(prop.table(table(train$target)),digits = 2)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## 
##    0    1 
## 0.46 0.54&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The reliable evaluation is that based on the unseen testing data rather than the training data.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;pred&amp;lt;-predict(modelnv,test)
confusionMatrix(pred,test$target)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 18  6
##          1  9 27
##                                           
##                Accuracy : 0.75            
##                  95% CI : (0.6214, 0.8528)
##     No Information Rate : 0.55            
##     P-Value [Acc &amp;gt; NIR] : 0.001116        
##                                           
##                   Kappa : 0.4898          
##                                           
##  Mcnemar&amp;#39;s Test P-Value : 0.605577        
##                                           
##             Sensitivity : 0.6667          
##             Specificity : 0.8182          
##          Pos Pred Value : 0.7500          
##          Neg Pred Value : 0.7500          
##              Prevalence : 0.4500          
##          Detection Rate : 0.3000          
##    Detection Prevalence : 0.4000          
##       Balanced Accuracy : 0.7424          
##                                           
##        &amp;#39;Positive&amp;#39; Class : 0               
## &lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The accuracy rate of the test set now is about 75%, may be due to overfitting problem, or this kind of model is not suitable for this data.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;model-fine-tuning&#34; class=&#34;section level1&#34; number=&#34;6&#34;&gt;
&lt;h1&gt;&lt;span class=&#34;header-section-number&#34;&gt;6&lt;/span&gt; Model fine-tuning:&lt;/h1&gt;
&lt;p&gt;In order to increase the model performance we can try another set of hyperparameters. Naive bayes model has different kernels and by default the usekernel argument is set to be &lt;strong&gt;FALSE&lt;/strong&gt; which allows the use of the gaussion distriburtion for the numeric variables,if &lt;strong&gt;TRUE&lt;/strong&gt; the kernel density estimation applies instead. Let’s turn it to be &lt;strong&gt;TRUE&lt;/strong&gt; and see what will happen for the test accuracy rate.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;modelnv1&amp;lt;-naive_bayes(target~.,data=train,
                      usekernel = TRUE)
pred&amp;lt;-predict(modelnv1,test)
confusionMatrix(pred,test$target)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 19  6
##          1  8 27
##                                           
##                Accuracy : 0.7667          
##                  95% CI : (0.6396, 0.8662)
##     No Information Rate : 0.55            
##     P-Value [Acc &amp;gt; NIR] : 0.0004231       
##                                           
##                   Kappa : 0.5254          
##                                           
##  Mcnemar&amp;#39;s Test P-Value : 0.7892680       
##                                           
##             Sensitivity : 0.7037          
##             Specificity : 0.8182          
##          Pos Pred Value : 0.7600          
##          Neg Pred Value : 0.7714          
##              Prevalence : 0.4500          
##          Detection Rate : 0.3167          
##    Detection Prevalence : 0.4167          
##       Balanced Accuracy : 0.7609          
##                                           
##        &amp;#39;Positive&amp;#39; Class : 0               
## &lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;After using the kernel estimation we have obtained a slight improvement for the accuracy rate which is now about 76%.&lt;/p&gt;
&lt;p&gt;Another way to improve the model is to try to preprocess the data, especailly for numeric when we standardize them they would follow the normal distribution.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;modelnv2&amp;lt;-train(target~., data=train,
                method=&amp;quot;naive_bayes&amp;quot;,
                preProc=c(&amp;quot;center&amp;quot;,&amp;quot;scale&amp;quot;))
modelnv2&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## Naive Bayes 
## 
## 243 samples
##  10 predictor
##   2 classes: &amp;#39;0&amp;#39;, &amp;#39;1&amp;#39; 
## 
## Pre-processing: centered (13), scaled (13) 
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 243, 243, 243, 243, 243, 243, ... 
## Resampling results across tuning parameters:
## 
##   usekernel  Accuracy   Kappa    
##   FALSE      0.7775205  0.5511328
##    TRUE      0.7490468  0.4988034
## 
## Tuning parameter &amp;#39;laplace&amp;#39; was held constant at a value of 0
## Tuning
##  parameter &amp;#39;adjust&amp;#39; was held constant at a value of 1
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were laplace = 0, usekernel = FALSE
##  and adjust = 1.&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;As we see we get better accuracy rate with the gaussion distribution 78.48% (when usekernel=FALSE) than with the kernel estimation 78.48%.&lt;/p&gt;
&lt;p&gt;Let’s use the test set:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;pred&amp;lt;-predict(modelnv2,test)
confusionMatrix(pred,test$target)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 19  5
##          1  8 28
##                                          
##                Accuracy : 0.7833         
##                  95% CI : (0.658, 0.8793)
##     No Information Rate : 0.55           
##     P-Value [Acc &amp;gt; NIR] : 0.0001472      
##                                          
##                   Kappa : 0.5578         
##                                          
##  Mcnemar&amp;#39;s Test P-Value : 0.5790997      
##                                          
##             Sensitivity : 0.7037         
##             Specificity : 0.8485         
##          Pos Pred Value : 0.7917         
##          Neg Pred Value : 0.7778         
##              Prevalence : 0.4500         
##          Detection Rate : 0.3167         
##    Detection Prevalence : 0.4000         
##       Balanced Accuracy : 0.7761         
##                                          
##        &amp;#39;Positive&amp;#39; Class : 0              
## &lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;We have another slight improvment with accuracy rate &lt;strong&gt;78.33&lt;/strong&gt; after scaling the data.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;conclusion&#34; class=&#34;section level1&#34; number=&#34;7&#34;&gt;
&lt;h1&gt;&lt;span class=&#34;header-section-number&#34;&gt;7&lt;/span&gt; Conclusion&lt;/h1&gt;
&lt;p&gt;Naive Bayes model is the most widely used model in the classical machine learning models, especially with features that are originally normally distributed or after transformation. However, compared to the bagged or boosted models like random forest exgboost models, or compared to deep learning models it is quite less attractive.&lt;/p&gt;
&lt;/div&gt;
</description>
    </item>
    
    <item>
      <title>knn model</title>
      <link>https://modelingwithr.rbind.io/post/2015-07-23-r-rmarkdown/</link>
      <pubDate>Mon, 16 Dec 2019 00:00:00 +0000</pubDate>
      <guid>https://modelingwithr.rbind.io/post/2015-07-23-r-rmarkdown/</guid>
      <description>
&lt;script src=&#34;https://modelingwithr.rbind.io/rmarkdown-libs/header-attrs/header-attrs.js&#34;&gt;&lt;/script&gt;

&lt;div id=&#34;TOC&#34;&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&#34;#introduction&#34;&gt;&lt;span class=&#34;toc-section-number&#34;&gt;1&lt;/span&gt; Introduction&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#classification&#34;&gt;&lt;span class=&#34;toc-section-number&#34;&gt;2&lt;/span&gt; Classification&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#data-partition&#34;&gt;&lt;span class=&#34;toc-section-number&#34;&gt;3&lt;/span&gt; Data partition&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#train-the-model&#34;&gt;&lt;span class=&#34;toc-section-number&#34;&gt;4&lt;/span&gt; Train the model&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#prediction-and-confusion-matrix&#34;&gt;&lt;span class=&#34;toc-section-number&#34;&gt;5&lt;/span&gt; Prediction and confusion matrix&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#fine-tuning-the-model&#34;&gt;&lt;span class=&#34;toc-section-number&#34;&gt;6&lt;/span&gt; Fine tuning the model&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#comparison-between-knn-and-svm-model&#34;&gt;&lt;span class=&#34;toc-section-number&#34;&gt;7&lt;/span&gt; Comparison between knn and svm model&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#regression&#34;&gt;&lt;span class=&#34;toc-section-number&#34;&gt;8&lt;/span&gt; Regression&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;

&lt;style type=&#34;text/css&#34;&gt;
strong {
  color: Navy;
}

h1,h2, h3, h4 {
  font-size:28px;
  color:DarkBlue;
}
&lt;/style&gt;
&lt;div id=&#34;introduction&#34; class=&#34;section level1&#34; number=&#34;1&#34;&gt;
&lt;h1&gt;&lt;span class=&#34;header-section-number&#34;&gt;1&lt;/span&gt; Introduction&lt;/h1&gt;
&lt;p&gt;In this paper we will explore the &lt;strong&gt;k nearest neighbors&lt;/strong&gt; model using two data sets, the first is &lt;strong&gt;Tiatanic&lt;/strong&gt; data to which we will fit this model for classification, and the second data is &lt;strong&gt;BostonHousing&lt;/strong&gt; data (from &lt;strong&gt;mlbench&lt;/strong&gt; package) that will be used to fit a regression model.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;classification&#34; class=&#34;section level1&#34; number=&#34;2&#34;&gt;
&lt;h1&gt;&lt;span class=&#34;header-section-number&#34;&gt;2&lt;/span&gt; Classification&lt;/h1&gt;
&lt;p&gt;We do not repeat the whole process for data preparation and missing values imputation. you can click &lt;a href=&#34;https://github.com/Metalesaek/svm-model&#34;&gt;here&lt;/a&gt; to see all the detail in my paper about &lt;strong&gt;support vector machine&lt;/strong&gt; model.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;data-partition&#34; class=&#34;section level1&#34; number=&#34;3&#34;&gt;
&lt;h1&gt;&lt;span class=&#34;header-section-number&#34;&gt;3&lt;/span&gt; Data partition&lt;/h1&gt;
&lt;p&gt;All the codes for the first steps are grouped in one chunk. If you notice we are using the same specified parameter values and seed numbers to be able to compare the results of the tow models &lt;strong&gt;svm&lt;/strong&gt; and &lt;strong&gt;knn&lt;/strong&gt; for &lt;strong&gt;classification&lt;/strong&gt; (Using titanic data) and for regression (using BostonHousing data)&lt;/p&gt;
&lt;p&gt;This plot shows how knn model works. With k=5 the model chooses the 5 closest points inside the dashed circle, and hence the blue point will be predicted to be red using the majority vote (3 red and 2 black), but with k=9 the blue point will be predicted to be black (5 black and 4 red).&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;library(plotrix)
plot(train$Age[10:40],pch=16,train$Fare[10:40],
     col=train$Survived,ylim = c(0,50))
points(x=32,y=20,col=&amp;quot;blue&amp;quot;,pch=8)
draw.circle(x=32,y=20,nv=1000,radius = 5.5,lty=2)
draw.circle(x=32,y=20,nv=1000,radius = 10)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://modelingwithr.rbind.io/post/2015-07-23-r-rmarkdown_files/figure-html/unnamed-chunk-3-1.svg&#34; width=&#34;576&#34; /&gt;&lt;/p&gt;
&lt;p&gt;The last things we should do before training the model is converting the factors to be numerics and standardizing all the predictors for both sets (train and test), and finally we rename the target variable levels&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;train1 &amp;lt;- train %&amp;gt;% mutate_at(c(2,3,8),funs(as.numeric))
test1 &amp;lt;- test %&amp;gt;% mutate_at(c(2,3,8),funs(as.numeric))

processed&amp;lt;-preProcess(train1[,-1],method = c(&amp;quot;center&amp;quot;,&amp;quot;scale&amp;quot;))
train1[,-1]&amp;lt;-predict(processed,train1[,-1])
test1[,-1]&amp;lt;-predict(processed,test1[,-1])

train1$Survived &amp;lt;- fct_recode(train1$Survived,died=&amp;quot;0&amp;quot;,surv=&amp;quot;1&amp;quot;)
test1$Survived &amp;lt;- fct_recode(test1$Survived,died=&amp;quot;0&amp;quot;,surv=&amp;quot;1&amp;quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div id=&#34;train-the-model&#34; class=&#34;section level1&#34; number=&#34;4&#34;&gt;
&lt;h1&gt;&lt;span class=&#34;header-section-number&#34;&gt;4&lt;/span&gt; Train the model&lt;/h1&gt;
&lt;p&gt;The big advantage of the &lt;strong&gt;k nearest neighbors&lt;/strong&gt; model is that it has one single parameters which make the tuning process very fast. Here also we will make use of the same seed as we did with &lt;strong&gt;svm&lt;/strong&gt; model. for the resampling process we will stick with the default bootstrapped method with 25 resampling iterations.&lt;/p&gt;
&lt;p&gt;Let’s now launch the model and get the summary.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;set.seed(123)
modelknn &amp;lt;- train(Survived~., data=train1,
                method=&amp;quot;knn&amp;quot;,
                tuneGrid=expand.grid(k=1:30))
modelknn&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## k-Nearest Neighbors 
## 
## 714 samples
##   7 predictor
##   2 classes: &amp;#39;died&amp;#39;, &amp;#39;surv&amp;#39; 
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 714, 714, 714, 714, 714, 714, ... 
## Resampling results across tuning parameters:
## 
##   k   Accuracy   Kappa    
##    1  0.7717650  0.5165447
##    2  0.7688433  0.5088538
##    3  0.7820906  0.5370428
##    4  0.7881072  0.5487894
##    5  0.8003926  0.5733224
##    6  0.7992870  0.5711806
##    7  0.8046907  0.5827968
##    8  0.8104254  0.5950159
##    9  0.8093172  0.5927121
##   10  0.8098395  0.5937574
##   11  0.8110456  0.5957105
##   12  0.8103966  0.5942937
##   13  0.8100784  0.5939193
##   14  0.8115080  0.5960496
##   15  0.8146848  0.6026109
##   16  0.8125027  0.5979064
##   17  0.8147065  0.6015528
##   18  0.8142485  0.6002677
##   19  0.8146543  0.6003686
##   20  0.8124733  0.5960520
##   21  0.8100367  0.5906732
##   22  0.8102084  0.5893078
##   23  0.8094241  0.5873995
##   24  0.8103509  0.5891549
##   25  0.8106517  0.5895533
##   26  0.8116000  0.5909129
##   27  0.8090177  0.5853052
##   28  0.8102358  0.5882055
##   29  0.8114371  0.5905057
##   30  0.8127604  0.5937279
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was k = 17.&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The metric used to get the best parameter value is the &lt;strong&gt;accuracy&lt;/strong&gt; rate , for which the best value is about 81.47% obtained at k=17. we can also get these values from the plot&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;plot(modelknn)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://modelingwithr.rbind.io/post/2015-07-23-r-rmarkdown_files/figure-html/unnamed-chunk-6-1.svg&#34; width=&#34;576&#34; /&gt;&lt;/p&gt;
&lt;p&gt;For the contributions of the predictors, the measure of importance scaled from 0 to 100 shows that the most important one is far the &lt;strong&gt;Sex&lt;/strong&gt;, followed by &lt;strong&gt;Fare&lt;/strong&gt; and &lt;strong&gt;Pclass&lt;/strong&gt; , and the least important one is &lt;strong&gt;SibSp&lt;/strong&gt;&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;varImp(modelknn)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## ROC curve variable importance
## 
##          Importance
## Sex         100.000
## Fare         62.476
## Pclass       57.192
## Embarked     17.449
## Parch        17.045
## Age           4.409
## SibSp         0.000&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div id=&#34;prediction-and-confusion-matrix&#34; class=&#34;section level1&#34; number=&#34;5&#34;&gt;
&lt;h1&gt;&lt;span class=&#34;header-section-number&#34;&gt;5&lt;/span&gt; Prediction and confusion matrix&lt;/h1&gt;
&lt;p&gt;Let’s now use the test set to evaluate the model performance.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;pred&amp;lt;-predict(modelknn,test1)
confusionMatrix(as.factor(pred),as.factor(test1$Survived))&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## Confusion Matrix and Statistics
## 
##           Reference
## Prediction died surv
##       died   99   26
##       surv   10   42
##                                           
##                Accuracy : 0.7966          
##                  95% CI : (0.7297, 0.8533)
##     No Information Rate : 0.6158          
##     P-Value [Acc &amp;gt; NIR] : 1.87e-07        
##                                           
##                   Kappa : 0.5503          
##                                           
##  Mcnemar&amp;#39;s Test P-Value : 0.01242         
##                                           
##             Sensitivity : 0.9083          
##             Specificity : 0.6176          
##          Pos Pred Value : 0.7920          
##          Neg Pred Value : 0.8077          
##              Prevalence : 0.6158          
##          Detection Rate : 0.5593          
##    Detection Prevalence : 0.7062          
##       Balanced Accuracy : 0.7630          
##                                           
##        &amp;#39;Positive&amp;#39; Class : died            
## &lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;We see that the accuracy has slightly decreased from 81.47% to 79.66. the closeness of this rates is a good sign that we do not face the &lt;strong&gt;overfitting&lt;/strong&gt; problem.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;fine-tuning-the-model&#34; class=&#34;section level1&#34; number=&#34;6&#34;&gt;
&lt;h1&gt;&lt;span class=&#34;header-section-number&#34;&gt;6&lt;/span&gt; Fine tuning the model&lt;/h1&gt;
&lt;p&gt;to seek improvements we can alter the metric. the best function that gives three importante metrics, &lt;strong&gt;sensitivity&lt;/strong&gt;, &lt;strong&gt;specivicity&lt;/strong&gt; and area under the &lt;strong&gt;ROC&lt;/strong&gt; curve for each resampling iteration is &lt;strong&gt;twoClassSummary&lt;/strong&gt;. Also we expand the grid search for the neighbors number to 30.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;control &amp;lt;- trainControl(classProbs = TRUE,
                        summaryFunction = twoClassSummary)

set.seed(123)
modelknn1 &amp;lt;- train(Survived~., data=train1,
                method = &amp;quot;knn&amp;quot;,
                trControl = control,
                tuneGrid = expand.grid(k=1:30))&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## Warning in train.default(x, y, weights = w, ...): The metric &amp;quot;Accuracy&amp;quot; was not
## in the result set. ROC will be used instead.&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;modelknn1&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## k-Nearest Neighbors 
## 
## 714 samples
##   7 predictor
##   2 classes: &amp;#39;died&amp;#39;, &amp;#39;surv&amp;#39; 
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 714, 714, 714, 714, 714, 714, ... 
## Resampling results across tuning parameters:
## 
##   k   ROC        Sens       Spec     
##    1  0.7637394  0.8092152  0.7114938
##    2  0.7959615  0.8102352  0.7013654
##    3  0.8212495  0.8217986  0.7180595
##    4  0.8351414  0.8302266  0.7201146
##    5  0.8455418  0.8448702  0.7283368
##    6  0.8543141  0.8441066  0.7269378
##    7  0.8564044  0.8477382  0.7350766
##    8  0.8590356  0.8526960  0.7421475
##    9  0.8617600  0.8511745  0.7414201
##   10  0.8611361  0.8512356  0.7424516
##   11  0.8621287  0.8546357  0.7399914
##   12  0.8633050  0.8542288  0.7392237
##   13  0.8647328  0.8526082  0.7407331
##   14  0.8656300  0.8572596  0.7369673
##   15  0.8663956  0.8612937  0.7388392
##   16  0.8657711  0.8595923  0.7359633
##   17  0.8658168  0.8652505  0.7322408
##   18  0.8659659  0.8657088  0.7301132
##   19  0.8667079  0.8685106  0.7261585
##   20  0.8668361  0.8657052  0.7252522
##   21  0.8673051  0.8641660  0.7212182
##   22  0.8672610  0.8701453  0.7118060
##   23  0.8675945  0.8703195  0.7094977
##   24  0.8677684  0.8724153  0.7087639
##   25  0.8681884  0.8733028  0.7080003
##   26  0.8681201  0.8768128  0.7048740
##   27  0.8680570  0.8748635  0.7011357
##   28  0.8685130  0.8745234  0.7047600
##   29  0.8686459  0.8756557  0.7055821
##   30  0.8681316  0.8754088  0.7094507
## 
## ROC was used to select the optimal model using the largest value.
## The final value used for the model was k = 29.&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This time we use the &lt;strong&gt;ROC&lt;/strong&gt; to choose the best model which gives a different value of 29 with 0.8686 for the &lt;strong&gt;ROC&lt;/strong&gt;.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;pred&amp;lt;-predict(modelknn1,test1)
confusionMatrix(pred,test1$Survived)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## Confusion Matrix and Statistics
## 
##           Reference
## Prediction died surv
##       died   99   29
##       surv   10   39
##                                           
##                Accuracy : 0.7797          
##                  95% CI : (0.7113, 0.8384)
##     No Information Rate : 0.6158          
##     P-Value [Acc &amp;gt; NIR] : 2.439e-06       
##                                           
##                   Kappa : 0.5085          
##                                           
##  Mcnemar&amp;#39;s Test P-Value : 0.003948        
##                                           
##             Sensitivity : 0.9083          
##             Specificity : 0.5735          
##          Pos Pred Value : 0.7734          
##          Neg Pred Value : 0.7959          
##              Prevalence : 0.6158          
##          Detection Rate : 0.5593          
##    Detection Prevalence : 0.7232          
##       Balanced Accuracy : 0.7409          
##                                           
##        &amp;#39;Positive&amp;#39; Class : died            
## &lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Using the &lt;strong&gt;ROC&lt;/strong&gt; metric we get worse result for the accuracy rate which has decreased from 79.66% to 77.97%.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;comparison-between-knn-and-svm-model&#34; class=&#34;section level1&#34; number=&#34;7&#34;&gt;
&lt;h1&gt;&lt;span class=&#34;header-section-number&#34;&gt;7&lt;/span&gt; Comparison between knn and svm model&lt;/h1&gt;
&lt;p&gt;Now let’s train svm model with the same resamling method and we compare between them.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;control&amp;lt;-trainControl(method=&amp;quot;boot&amp;quot;,number=25,
                      classProbs = TRUE,
                      summaryFunction = twoClassSummary)

modelsvm&amp;lt;-train(Survived~., data=train1,
                method=&amp;quot;svmRadial&amp;quot;,
                trControl=control)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## Warning in train.default(x, y, weights = w, ...): The metric &amp;quot;Accuracy&amp;quot; was not
## in the result set. ROC will be used instead.&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;modelsvm&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## Support Vector Machines with Radial Basis Function Kernel 
## 
## 714 samples
##   7 predictor
##   2 classes: &amp;#39;died&amp;#39;, &amp;#39;surv&amp;#39; 
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 714, 714, 714, 714, 714, 714, ... 
## Resampling results across tuning parameters:
## 
##   C     ROC        Sens       Spec     
##   0.25  0.8703474  0.8735475  0.7602162
##   0.50  0.8706929  0.8858278  0.7456306
##   1.00  0.8655619  0.8941179  0.7327856
## 
## Tuning parameter &amp;#39;sigma&amp;#39; was held constant at a value of 0.2282701
## ROC was used to select the optimal model using the largest value.
## The final values used for the model were sigma = 0.2282701 and C = 0.5.&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;And let’s get the confusion matrix.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;pred&amp;lt;-predict(modelsvm,test1)
confusionMatrix(pred,test1$Survived)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## Confusion Matrix and Statistics
## 
##           Reference
## Prediction died surv
##       died  101   27
##       surv    8   41
##                                           
##                Accuracy : 0.8023          
##                  95% CI : (0.7359, 0.8582)
##     No Information Rate : 0.6158          
##     P-Value [Acc &amp;gt; NIR] : 7.432e-08       
##                                           
##                   Kappa : 0.5589          
##                                           
##  Mcnemar&amp;#39;s Test P-Value : 0.002346        
##                                           
##             Sensitivity : 0.9266          
##             Specificity : 0.6029          
##          Pos Pred Value : 0.7891          
##          Neg Pred Value : 0.8367          
##              Prevalence : 0.6158          
##          Detection Rate : 0.5706          
##    Detection Prevalence : 0.7232          
##       Balanced Accuracy : 0.7648          
##                                           
##        &amp;#39;Positive&amp;#39; Class : died            
## &lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;we see that the accuracy fo this model is much higher with 80.23% than the knn model with 77.97% (the &lt;strong&gt;modelknn1&lt;/strong&gt;).
If we have a large number of models to be compared, there exists a function in &lt;strong&gt;caret&lt;/strong&gt; called &lt;strong&gt;resamples&lt;/strong&gt; to compare between models,but the models should have the same tarincontrol prameter values.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;comp&amp;lt;-resamples(list( svm = modelsvm,
                         knn = modelknn1))

summary(comp)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## 
## Call:
## summary.resamples(object = comp)
## 
## Models: svm, knn 
## Number of resamples: 25 
## 
## ROC 
##          Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA&amp;#39;s
## svm 0.8472858 0.8617944 0.8691093 0.8706929 0.8744979 0.9043001    0
## knn 0.8298966 0.8577167 0.8670815 0.8686459 0.8792487 0.9135638    0
## 
## Sens 
##          Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA&amp;#39;s
## svm 0.8117647 0.8666667 0.8870056 0.8858278 0.9030303 0.9559748    0
## knn 0.8266667 0.8523490 0.8816568 0.8756557 0.8950617 0.9117647    0
## 
## Spec 
##          Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA&amp;#39;s
## svm 0.6774194 0.7096774 0.7428571 0.7456306 0.7714286 0.8425926    0
## knn 0.5865385 0.6741573 0.6989247 0.7055821 0.7252747 0.8191489    0&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;we can also plot the models’ matric values togather.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;dotplot(comp,metric=&amp;quot;ROC&amp;quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://modelingwithr.rbind.io/post/2015-07-23-r-rmarkdown_files/figure-html/unnamed-chunk-14-1.svg&#34; width=&#34;576&#34; /&gt;&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;regression&#34; class=&#34;section level1&#34; number=&#34;8&#34;&gt;
&lt;h1&gt;&lt;span class=&#34;header-section-number&#34;&gt;8&lt;/span&gt; Regression&lt;/h1&gt;
&lt;p&gt;First we call the &lt;strong&gt;BostonHousing&lt;/strong&gt; data.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;library(mlbench)
data(&amp;quot;BostonHousing&amp;quot;)
glimpse(BostonHousing)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## Rows: 506
## Columns: 14
## $ crim    &amp;lt;dbl&amp;gt; 0.00632, 0.02731, 0.02729, 0.03237, 0.06905, 0.02985, 0.088...
## $ zn      &amp;lt;dbl&amp;gt; 18.0, 0.0, 0.0, 0.0, 0.0, 0.0, 12.5, 12.5, 12.5, 12.5, 12.5...
## $ indus   &amp;lt;dbl&amp;gt; 2.31, 7.07, 7.07, 2.18, 2.18, 2.18, 7.87, 7.87, 7.87, 7.87,...
## $ chas    &amp;lt;fct&amp;gt; 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ nox     &amp;lt;dbl&amp;gt; 0.538, 0.469, 0.469, 0.458, 0.458, 0.458, 0.524, 0.524, 0.5...
## $ rm      &amp;lt;dbl&amp;gt; 6.575, 6.421, 7.185, 6.998, 7.147, 6.430, 6.012, 6.172, 5.6...
## $ age     &amp;lt;dbl&amp;gt; 65.2, 78.9, 61.1, 45.8, 54.2, 58.7, 66.6, 96.1, 100.0, 85.9...
## $ dis     &amp;lt;dbl&amp;gt; 4.0900, 4.9671, 4.9671, 6.0622, 6.0622, 6.0622, 5.5605, 5.9...
## $ rad     &amp;lt;dbl&amp;gt; 1, 2, 2, 3, 3, 3, 5, 5, 5, 5, 5, 5, 5, 4, 4, 4, 4, 4, 4, 4,...
## $ tax     &amp;lt;dbl&amp;gt; 296, 242, 242, 222, 222, 222, 311, 311, 311, 311, 311, 311,...
## $ ptratio &amp;lt;dbl&amp;gt; 15.3, 17.8, 17.8, 18.7, 18.7, 18.7, 15.2, 15.2, 15.2, 15.2,...
## $ b       &amp;lt;dbl&amp;gt; 396.90, 396.90, 392.83, 394.63, 396.90, 394.12, 395.60, 396...
## $ lstat   &amp;lt;dbl&amp;gt; 4.98, 9.14, 4.03, 2.94, 5.33, 5.21, 12.43, 19.15, 29.93, 17...
## $ medv    &amp;lt;dbl&amp;gt; 24.0, 21.6, 34.7, 33.4, 36.2, 28.7, 22.9, 27.1, 16.5, 18.9,...&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;We will train a knn model to this data using the continuous variable as target &lt;strong&gt;medv&lt;/strong&gt;&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;set.seed(1234)
index&amp;lt;-sample(nrow(BostonHousing),size = floor(0.8*(nrow(BostonHousing))))
train&amp;lt;-BostonHousing[index,]
test&amp;lt;-BostonHousing[-index,]

scaled&amp;lt;-preProcess(train[,-14],method=c(&amp;quot;center&amp;quot;,&amp;quot;scale&amp;quot;))
trainscaled&amp;lt;-predict(scaled,train)
testscaled&amp;lt;-predict(scaled,test)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;We are ready now to train our model.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;set.seed(123)
modelknnR &amp;lt;- train(medv~., data=trainscaled,
                method = &amp;quot;knn&amp;quot;,
                tuneGrid = expand.grid(k=1:60))
modelknnR&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## k-Nearest Neighbors 
## 
## 404 samples
##  13 predictor
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 404, 404, 404, 404, 404, 404, ... 
## Resampling results across tuning parameters:
## 
##   k   RMSE      Rsquared   MAE     
##    1  4.711959  0.7479439  3.047925
##    2  4.600795  0.7545325  3.010235
##    3  4.554112  0.7583915  3.001404
##    4  4.416511  0.7733563  2.939100
##    5  4.414384  0.7736985  2.953741
##    6  4.405364  0.7758010  2.962082
##    7  4.375360  0.7799181  2.955250
##    8  4.409134  0.7773310  2.975489
##    9  4.427529  0.7770847  2.973016
##   10  4.414577  0.7804842  2.957983
##   11  4.447188  0.7787709  2.968389
##   12  4.475134  0.7767642  2.984709
##   13  4.489486  0.7760909  3.000489
##   14  4.518792  0.7746895  3.026858
##   15  4.554107  0.7717809  3.043645
##   16  4.583672  0.7694136  3.058097
##   17  4.599290  0.7695640  3.067001
##   18  4.632439  0.7671729  3.079895
##   19  4.670589  0.7643210  3.098643
##   20  4.708318  0.7614855  3.118593
##   21  4.736963  0.7596509  3.137784
##   22  4.756688  0.7590899  3.151654
##   23  4.781692  0.7577281  3.166203
##   24  4.813669  0.7554223  3.186575
##   25  4.843954  0.7533415  3.200120
##   26  4.872096  0.7513071  3.224031
##   27  4.896463  0.7502052  3.238489
##   28  4.920242  0.7497138  3.252959
##   29  4.944899  0.7484320  3.269227
##   30  4.966726  0.7479621  3.282756
##   31  4.996149  0.7460973  3.303607
##   32  5.024602  0.7438775  3.321013
##   33  5.055147  0.7420656  3.338457
##   34  5.083713  0.7403972  3.360867
##   35  5.108994  0.7388352  3.373694
##   36  5.132420  0.7372288  3.389177
##   37  5.156841  0.7354463  3.409025
##   38  5.175413  0.7349417  3.422294
##   39  5.196438  0.7340164  3.434986
##   40  5.225990  0.7314822  3.452499
##   41  5.249335  0.7299159  3.467267
##   42  5.275185  0.7281473  3.484101
##   43  5.300558  0.7263045  3.502388
##   44  5.322795  0.7251719  3.519220
##   45  5.349383  0.7232707  3.539266
##   46  5.376209  0.7210830  3.560509
##   47  5.398400  0.7199706  3.580476
##   48  5.424020  0.7180096  3.595497
##   49  5.445069  0.7166620  3.609308
##   50  5.469650  0.7145816  3.625718
##   51  5.492104  0.7127439  3.644329
##   52  5.515714  0.7107894  3.659286
##   53  5.535354  0.7092366  3.672172
##   54  5.562260  0.7063225  3.690854
##   55  5.581394  0.7049997  3.705917
##   56  5.600579  0.7036881  3.720464
##   57  5.623071  0.7018951  3.739874
##   58  5.645828  0.6999889  3.755824
##   59  5.662777  0.6990085  3.771570
##   60  5.682182  0.6976068  3.787733
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 7.&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The best model with k=7 for which the minimum RMSE is about 4.3757.&lt;/p&gt;
&lt;p&gt;We can also get the importance of the predictors.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;plot(varImp(modelknnR))&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://modelingwithr.rbind.io/post/2015-07-23-r-rmarkdown_files/figure-html/unnamed-chunk-18-1.svg&#34; width=&#34;576&#34; /&gt;&lt;/p&gt;
&lt;p&gt;Then we get the prediction and the root mean squared error &lt;strong&gt;RMSE&lt;/strong&gt; as follows.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;pred&amp;lt;-predict(modelknnR,testscaled)
head(pred)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## [1] 24.94286 29.88571 20.67143 20.31429 19.18571 20.28571&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;RMSE(pred,test$medv)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## [1] 4.416328&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The RMSE using the test set is about &lt;strong&gt;4.4163&lt;/strong&gt; which is slightly greater than that of the training set &lt;strong&gt;4.3757&lt;/strong&gt; .
Finally we can plot the predicted values vs the observed values to get insight about their relationship.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;ggplot(data.frame(predicted=pred,observed=test$medv),aes(pred,test$medv))+
  geom_point(col=&amp;quot;blue&amp;quot;)+
  geom_abline(col=&amp;quot;red&amp;quot;)+
  ggtitle(&amp;quot;actual values vs predicted values&amp;quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://modelingwithr.rbind.io/post/2015-07-23-r-rmarkdown_files/figure-html/unnamed-chunk-20-1.svg&#34; width=&#34;576&#34; /&gt;&lt;/p&gt;
&lt;/div&gt;
</description>
    </item>
    
    <item>
      <title>Methods for dealing with imbalanced data</title>
      <link>https://modelingwithr.rbind.io/post/methods-to-deal-with-imbalanced-data/</link>
      <pubDate>Wed, 10 Apr 2019 00:00:00 +0000</pubDate>
      <guid>https://modelingwithr.rbind.io/post/methods-to-deal-with-imbalanced-data/</guid>
      <description>
&lt;script src=&#34;https://modelingwithr.rbind.io/rmarkdown-libs/header-attrs/header-attrs.js&#34;&gt;&lt;/script&gt;

&lt;div id=&#34;TOC&#34;&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&#34;#introduction&#34;&gt;&lt;span class=&#34;toc-section-number&#34;&gt;1&lt;/span&gt; Introduction&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#data-partition&#34;&gt;&lt;span class=&#34;toc-section-number&#34;&gt;2&lt;/span&gt; Data partition&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#subsampling-the-training-data&#34;&gt;&lt;span class=&#34;toc-section-number&#34;&gt;3&lt;/span&gt; Subsampling the training data&lt;/a&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&#34;#upsampling&#34;&gt;&lt;span class=&#34;toc-section-number&#34;&gt;3.1&lt;/span&gt; Upsampling&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#downsampling&#34;&gt;&lt;span class=&#34;toc-section-number&#34;&gt;3.2&lt;/span&gt; downsampling&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#rose&#34;&gt;&lt;span class=&#34;toc-section-number&#34;&gt;3.3&lt;/span&gt; ROSE&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#smote&#34;&gt;&lt;span class=&#34;toc-section-number&#34;&gt;3.4&lt;/span&gt; SMOTE&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#training-logistic-regression-model.&#34;&gt;&lt;span class=&#34;toc-section-number&#34;&gt;4&lt;/span&gt; training logistic regression model.&lt;/a&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&#34;#without-subsampling&#34;&gt;&lt;span class=&#34;toc-section-number&#34;&gt;4.1&lt;/span&gt; without subsampling&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#upsampling-the-train-set&#34;&gt;&lt;span class=&#34;toc-section-number&#34;&gt;4.2&lt;/span&gt; Upsampling the train set&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#down-sampling-the-training-set.&#34;&gt;&lt;span class=&#34;toc-section-number&#34;&gt;4.3&lt;/span&gt; Down sampling the training set.&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#subsampline-the-train-set-by-rose-technique&#34;&gt;&lt;span class=&#34;toc-section-number&#34;&gt;4.4&lt;/span&gt; subsampline the train set by ROSE technique&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#subsampling-the-train-set-by-smote-technique&#34;&gt;&lt;span class=&#34;toc-section-number&#34;&gt;4.5&lt;/span&gt; Subsampling the train set by SMOTE technique&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#deep-learning-model-without-class-weight.&#34;&gt;&lt;span class=&#34;toc-section-number&#34;&gt;5&lt;/span&gt; deep learning model (without class weight).&lt;/a&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&#34;#deep-learning-model-with-class-weights&#34;&gt;&lt;span class=&#34;toc-section-number&#34;&gt;5.1&lt;/span&gt; deep learning model with class weights&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#conclusion&#34;&gt;&lt;span class=&#34;toc-section-number&#34;&gt;6&lt;/span&gt; Conclusion&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;

&lt;style type=&#34;text/css&#34;&gt;
strong {
  color: Navy;
}

h1,h2, h3, h4 {
  font-size:28px;
  color:DarkBlue;
}
&lt;/style&gt;
&lt;div id=&#34;introduction&#34; class=&#34;section level1&#34; number=&#34;1&#34;&gt;
&lt;h1&gt;&lt;span class=&#34;header-section-number&#34;&gt;1&lt;/span&gt; Introduction&lt;/h1&gt;
&lt;p&gt;The imbalanced data is the common feature of some type of data such as fraudulent credit card where the number of fraudulent cards is usually very small compared to the number of non fraudulent cards. The problem with imbalanced data is that the model being trained would be dominated by the majority class such as &lt;strong&gt;knn&lt;/strong&gt; and &lt;strong&gt;svm&lt;/strong&gt; models, and hence they would predict the majority class more effectively than the minority class which in turn would result in high value for sensitivity rate and low value for specificity rate (in binary classification).&lt;/p&gt;
&lt;p&gt;The simple technique to reduce the negative impact of this problem is by subsampling the data. the common subsampling methods used in practice are the following.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Upsampling&lt;/strong&gt;: this method increases the size of the minority class by sampling with replacement so that the classes will have the same size.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Downsampling&lt;/strong&gt;: in contrast to the above method, this one decreases the size of the majority class to be the same or closer to the minority class size by just taking out a random sample.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Hybrid methods&lt;/strong&gt; : The well known hybrid methods are &lt;strong&gt;ROSE&lt;/strong&gt; (Random oversampling examples), and &lt;strong&gt;SMOTE&lt;/strong&gt; (Synthetic minority oversampling technique), they downsample the majority class, and creat new artificial points in the minority class. For more detail about &lt;strong&gt;SMOTE&lt;/strong&gt; method click &lt;a href=&#34;https://journals.sagepub.com/doi/full/10.1177/0272989X14560647&#34;&gt;here&lt;/a&gt;, and for &lt;strong&gt;ROSE&lt;/strong&gt; click &lt;a href=&#34;https://www.rdocumentation.org/packages/ROSE/versions/0.0-3/topics/ROSE&#34;&gt;here&lt;/a&gt;.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Note&lt;/strong&gt;: all the above methods should be applied only on the training set , the testing set must be never touched until the final model evaluation step.&lt;/p&gt;
&lt;p&gt;Some type of models can handle imbalanced data such as &lt;strong&gt;deep learning&lt;/strong&gt; model with the argument &lt;strong&gt;class_weight&lt;/strong&gt; wich adds more weights to the minority class cases. Other models, however, such as &lt;strong&gt;svm&lt;/strong&gt; or &lt;strong&gt;knn&lt;/strong&gt; we have to make use of one of the above methods before training these type of models.&lt;/p&gt;
&lt;p&gt;In this article we will make use of the &lt;strong&gt;creditcard&lt;/strong&gt; data from kaggle website -click &lt;a href=&#34;https://www.kaggle.com/arvindratan/creditcard#creditcard.csv&#34;&gt;here&lt;/a&gt; to upload this data, which is highly imbalanced- and we will train a &lt;strong&gt;logistic regression&lt;/strong&gt; model on the raw data and on the transformed data after applying the above methods and comparing the results. Also, we will use a simple deep learning model with and without taking into account the imbalanced problem.&lt;/p&gt;
&lt;p&gt;First we call the data.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;spsm(library(tidyverse))&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## Warning: package &amp;#39;ggplot2&amp;#39; was built under R version 4.0.2&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## Warning: package &amp;#39;tibble&amp;#39; was built under R version 4.0.2&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## Warning: package &amp;#39;tidyr&amp;#39; was built under R version 4.0.2&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## Warning: package &amp;#39;dplyr&amp;#39; was built under R version 4.0.2&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;data &amp;lt;- read.csv(&amp;quot;../sparklyr/creditcard.csv&amp;quot;, header = TRUE)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;For privacy purposes the original features are replaced by the PCA variables from v1 to v28 and only &lt;strong&gt;Time&lt;/strong&gt; and &lt;strong&gt;Amount&lt;/strong&gt; features that are left from the original features.&lt;/p&gt;
&lt;p&gt;Let’s first check &lt;strong&gt;Class&lt;/strong&gt; variable levels frequency (after having been converted to a factor type).&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;data$Class &amp;lt;- as.factor(data$Class)
prop.table(table(data$Class))&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## 
##           0           1 
## 0.998272514 0.001727486&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;As we see the minority class number “1” is only about 0.17% of the total cases.
We also need to show the summary of the data to take an overall look at all the features to be aware of missing values or unusual outliers.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;summary(data)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;##       Time              V1                  V2                  V3          
##  Min.   :     0   Min.   :-56.40751   Min.   :-72.71573   Min.   :-48.3256  
##  1st Qu.: 54202   1st Qu.: -0.92037   1st Qu.: -0.59855   1st Qu.: -0.8904  
##  Median : 84692   Median :  0.01811   Median :  0.06549   Median :  0.1799  
##  Mean   : 94814   Mean   :  0.00000   Mean   :  0.00000   Mean   :  0.0000  
##  3rd Qu.:139321   3rd Qu.:  1.31564   3rd Qu.:  0.80372   3rd Qu.:  1.0272  
##  Max.   :172792   Max.   :  2.45493   Max.   : 22.05773   Max.   :  9.3826  
##        V4                 V5                   V6                 V7          
##  Min.   :-5.68317   Min.   :-113.74331   Min.   :-26.1605   Min.   :-43.5572  
##  1st Qu.:-0.84864   1st Qu.:  -0.69160   1st Qu.: -0.7683   1st Qu.: -0.5541  
##  Median :-0.01985   Median :  -0.05434   Median : -0.2742   Median :  0.0401  
##  Mean   : 0.00000   Mean   :   0.00000   Mean   :  0.0000   Mean   :  0.0000  
##  3rd Qu.: 0.74334   3rd Qu.:   0.61193   3rd Qu.:  0.3986   3rd Qu.:  0.5704  
##  Max.   :16.87534   Max.   :  34.80167   Max.   : 73.3016   Max.   :120.5895  
##        V8                  V9                 V10                 V11          
##  Min.   :-73.21672   Min.   :-13.43407   Min.   :-24.58826   Min.   :-4.79747  
##  1st Qu.: -0.20863   1st Qu.: -0.64310   1st Qu.: -0.53543   1st Qu.:-0.76249  
##  Median :  0.02236   Median : -0.05143   Median : -0.09292   Median :-0.03276  
##  Mean   :  0.00000   Mean   :  0.00000   Mean   :  0.00000   Mean   : 0.00000  
##  3rd Qu.:  0.32735   3rd Qu.:  0.59714   3rd Qu.:  0.45392   3rd Qu.: 0.73959  
##  Max.   : 20.00721   Max.   : 15.59500   Max.   : 23.74514   Max.   :12.01891  
##       V12                V13                V14                V15          
##  Min.   :-18.6837   Min.   :-5.79188   Min.   :-19.2143   Min.   :-4.49894  
##  1st Qu.: -0.4056   1st Qu.:-0.64854   1st Qu.: -0.4256   1st Qu.:-0.58288  
##  Median :  0.1400   Median :-0.01357   Median :  0.0506   Median : 0.04807  
##  Mean   :  0.0000   Mean   : 0.00000   Mean   :  0.0000   Mean   : 0.00000  
##  3rd Qu.:  0.6182   3rd Qu.: 0.66251   3rd Qu.:  0.4931   3rd Qu.: 0.64882  
##  Max.   :  7.8484   Max.   : 7.12688   Max.   : 10.5268   Max.   : 8.87774  
##       V16                 V17                 V18           
##  Min.   :-14.12985   Min.   :-25.16280   Min.   :-9.498746  
##  1st Qu.: -0.46804   1st Qu.: -0.48375   1st Qu.:-0.498850  
##  Median :  0.06641   Median : -0.06568   Median :-0.003636  
##  Mean   :  0.00000   Mean   :  0.00000   Mean   : 0.000000  
##  3rd Qu.:  0.52330   3rd Qu.:  0.39968   3rd Qu.: 0.500807  
##  Max.   : 17.31511   Max.   :  9.25353   Max.   : 5.041069  
##       V19                 V20                 V21           
##  Min.   :-7.213527   Min.   :-54.49772   Min.   :-34.83038  
##  1st Qu.:-0.456299   1st Qu.: -0.21172   1st Qu.: -0.22839  
##  Median : 0.003735   Median : -0.06248   Median : -0.02945  
##  Mean   : 0.000000   Mean   :  0.00000   Mean   :  0.00000  
##  3rd Qu.: 0.458949   3rd Qu.:  0.13304   3rd Qu.:  0.18638  
##  Max.   : 5.591971   Max.   : 39.42090   Max.   : 27.20284  
##       V22                  V23                 V24          
##  Min.   :-10.933144   Min.   :-44.80774   Min.   :-2.83663  
##  1st Qu.: -0.542350   1st Qu.: -0.16185   1st Qu.:-0.35459  
##  Median :  0.006782   Median : -0.01119   Median : 0.04098  
##  Mean   :  0.000000   Mean   :  0.00000   Mean   : 0.00000  
##  3rd Qu.:  0.528554   3rd Qu.:  0.14764   3rd Qu.: 0.43953  
##  Max.   : 10.503090   Max.   : 22.52841   Max.   : 4.58455  
##       V25                 V26                V27            
##  Min.   :-10.29540   Min.   :-2.60455   Min.   :-22.565679  
##  1st Qu.: -0.31715   1st Qu.:-0.32698   1st Qu.: -0.070840  
##  Median :  0.01659   Median :-0.05214   Median :  0.001342  
##  Mean   :  0.00000   Mean   : 0.00000   Mean   :  0.000000  
##  3rd Qu.:  0.35072   3rd Qu.: 0.24095   3rd Qu.:  0.091045  
##  Max.   :  7.51959   Max.   : 3.51735   Max.   : 31.612198  
##       V28                Amount         Class     
##  Min.   :-15.43008   Min.   :    0.00   0:284315  
##  1st Qu.: -0.05296   1st Qu.:    5.60   1:   492  
##  Median :  0.01124   Median :   22.00             
##  Mean   :  0.00000   Mean   :   88.35             
##  3rd Qu.:  0.07828   3rd Qu.:   77.17             
##  Max.   : 33.84781   Max.   :25691.16&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;looking at this summary, we do not have any critical issues like missing values for instance.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;data-partition&#34; class=&#34;section level1&#34; number=&#34;2&#34;&gt;
&lt;h1&gt;&lt;span class=&#34;header-section-number&#34;&gt;2&lt;/span&gt; Data partition&lt;/h1&gt;
&lt;p&gt;Before applying any subsampling method we split the data first between the training set and the testing set and we use only the former to be subsampled.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;spsm(library(caret))
set.seed(1234)
index &amp;lt;- createDataPartition(data$Class, p = 0.8, list = FALSE)
train &amp;lt;- data[index, ]
test &amp;lt;- data[-index, ]&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div id=&#34;subsampling-the-training-data&#34; class=&#34;section level1&#34; number=&#34;3&#34;&gt;
&lt;h1&gt;&lt;span class=&#34;header-section-number&#34;&gt;3&lt;/span&gt; Subsampling the training data&lt;/h1&gt;
&lt;div id=&#34;upsampling&#34; class=&#34;section level2&#34; number=&#34;3.1&#34;&gt;
&lt;h2&gt;&lt;span class=&#34;header-section-number&#34;&gt;3.1&lt;/span&gt; Upsampling&lt;/h2&gt;
&lt;p&gt;The &lt;strong&gt;caret&lt;/strong&gt; package provides a function called &lt;strong&gt;upSample&lt;/strong&gt; to perform upsampling technique.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;set.seed(111)
trainup &amp;lt;- upSample(x = train[, -ncol(train)], y = train$Class)
table(trainup$Class)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## 
##      0      1 
## 227452 227452&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;As we see the two classes now have the same size &lt;strong&gt;227452&lt;/strong&gt;&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;downsampling&#34; class=&#34;section level2&#34; number=&#34;3.2&#34;&gt;
&lt;h2&gt;&lt;span class=&#34;header-section-number&#34;&gt;3.2&lt;/span&gt; downsampling&lt;/h2&gt;
&lt;p&gt;By the some way we make use of the caret function &lt;strong&gt;downSample&lt;/strong&gt;&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;set.seed(111)
traindown &amp;lt;- downSample(x = train[, -ncol(train)], y = train$Class)
table(traindown$Class)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## 
##   0   1 
## 394 394&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;now the size of each class is &lt;strong&gt;394&lt;/strong&gt;&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;rose&#34; class=&#34;section level2&#34; number=&#34;3.3&#34;&gt;
&lt;h2&gt;&lt;span class=&#34;header-section-number&#34;&gt;3.3&lt;/span&gt; ROSE&lt;/h2&gt;
&lt;p&gt;To use this technique we have to call the &lt;strong&gt;ROSE&lt;/strong&gt; package&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;spsm(library(ROSE))
set.seed(111)
trainrose &amp;lt;- ROSE(Class ~ ., data = train)$data
table(trainrose$Class)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## 
##      0      1 
## 113827 114019&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;since this technique add new synthetic data points to the minority class and daownsamples the majority class the size now is about &lt;strong&gt;114019&lt;/strong&gt; for minority class and &lt;strong&gt;113827&lt;/strong&gt; for the majority class.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;smote&#34; class=&#34;section level2&#34; number=&#34;3.4&#34;&gt;
&lt;h2&gt;&lt;span class=&#34;header-section-number&#34;&gt;3.4&lt;/span&gt; SMOTE&lt;/h2&gt;
&lt;p&gt;this technique requires the &lt;strong&gt;DMwR&lt;/strong&gt; package.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;spsm(library(DMwR))
set.seed(111)
trainsmote &amp;lt;- SMOTE(Class ~ ., data = train)
table(trainsmote$Class)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## 
##    0    1 
## 1576 1182&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The size of the majority class is &lt;strong&gt;113827&lt;/strong&gt; and for the minority class is &lt;strong&gt;114019&lt;/strong&gt; .&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div id=&#34;training-logistic-regression-model.&#34; class=&#34;section level1&#34; number=&#34;4&#34;&gt;
&lt;h1&gt;&lt;span class=&#34;header-section-number&#34;&gt;4&lt;/span&gt; training logistic regression model.&lt;/h1&gt;
&lt;p&gt;we are now ready to fit logit model to the original training set without subsampling, and to each of the above subsampled training sets.&lt;/p&gt;
&lt;div id=&#34;without-subsampling&#34; class=&#34;section level2&#34; number=&#34;4.1&#34;&gt;
&lt;h2&gt;&lt;span class=&#34;header-section-number&#34;&gt;4.1&lt;/span&gt; without subsampling&lt;/h2&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;set.seed(123)
model &amp;lt;- glm(Class ~ ., data = train, family = &amp;quot;binomial&amp;quot;)
summary(model)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## 
## Call:
## glm(formula = Class ~ ., family = &amp;quot;binomial&amp;quot;, data = train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -4.9290  -0.0291  -0.0190  -0.0124   4.6028  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(&amp;gt;|z|)    
## (Intercept) -8.486e+00  2.852e-01 -29.753  &amp;lt; 2e-16 ***
## Time        -2.673e-06  2.528e-06  -1.057  0.29037    
## V1           9.397e-02  4.794e-02   1.960  0.04996 *  
## V2           1.097e-02  6.706e-02   0.164  0.87006    
## V3           1.290e-03  5.949e-02   0.022  0.98270    
## V4           6.851e-01  8.408e-02   8.148 3.69e-16 ***
## V5           1.472e-01  7.301e-02   2.017  0.04372 *  
## V6          -8.450e-02  7.902e-02  -1.069  0.28491    
## V7          -1.098e-01  7.591e-02  -1.446  0.14816    
## V8          -1.718e-01  3.402e-02  -5.050 4.41e-07 ***
## V9          -1.926e-01  1.258e-01  -1.531  0.12579    
## V10         -8.073e-01  1.118e-01  -7.224 5.07e-13 ***
## V11         -3.920e-03  9.131e-02  -0.043  0.96575    
## V12          2.855e-02  9.432e-02   0.303  0.76210    
## V13         -3.064e-01  9.007e-02  -3.401  0.00067 ***
## V14         -5.308e-01  6.816e-02  -7.787 6.86e-15 ***
## V15         -1.285e-01  9.559e-02  -1.344  0.17903    
## V16         -2.164e-01  1.423e-01  -1.520  0.12840    
## V17          2.913e-02  7.729e-02   0.377  0.70624    
## V18         -3.642e-02  1.445e-01  -0.252  0.80095    
## V19          6.064e-02  1.094e-01   0.554  0.57938    
## V20         -4.449e-01  9.737e-02  -4.570 4.89e-06 ***
## V21          3.661e-01  6.709e-02   5.456 4.87e-08 ***
## V22          5.965e-01  1.519e-01   3.927 8.59e-05 ***
## V23         -1.157e-01  6.545e-02  -1.768  0.07706 .  
## V24          8.146e-02  1.625e-01   0.501  0.61622    
## V25          4.325e-02  1.482e-01   0.292  0.77043    
## V26         -2.679e-01  2.226e-01  -1.203  0.22893    
## V27         -7.280e-01  1.542e-01  -4.720 2.36e-06 ***
## V28         -2.817e-01  9.864e-02  -2.856  0.00429 ** 
## Amount       9.154e-04  4.379e-04   2.091  0.03656 *  
## ---
## Signif. codes:  0 &amp;#39;***&amp;#39; 0.001 &amp;#39;**&amp;#39; 0.01 &amp;#39;*&amp;#39; 0.05 &amp;#39;.&amp;#39; 0.1 &amp;#39; &amp;#39; 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 5799.1  on 227845  degrees of freedom
## Residual deviance: 1768.0  on 227815  degrees of freedom
## AIC: 1830
## 
## Number of Fisher Scoring iterations: 12&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;At this step and to make things more simpler, we remove the insignificant variables (without asterix) and we keep the remaining ones to use in all the following models.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;set.seed(123)
model1 &amp;lt;- glm(Class ~ . - Time - V2 - V3 - V6 - V7 - V9 - V11 - V12 - V15 - V16 - 
    V17 - V18 - V19 - V24 - V25 - V26, data = train, family = &amp;quot;binomial&amp;quot;)
summary(model1)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## 
## Call:
## glm(formula = Class ~ . - Time - V2 - V3 - V6 - V7 - V9 - V11 - 
##     V12 - V15 - V16 - V17 - V18 - V19 - V24 - V25 - V26, family = &amp;quot;binomial&amp;quot;, 
##     data = train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -4.6514  -0.0290  -0.0186  -0.0117   4.6192  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(&amp;gt;|z|)    
## (Intercept) -8.763e+00  1.510e-01 -58.023  &amp;lt; 2e-16 ***
## V1           2.108e-02  2.918e-02   0.722 0.470129    
## V4           7.241e-01  6.306e-02  11.483  &amp;lt; 2e-16 ***
## V5           9.934e-02  3.566e-02   2.785 0.005346 ** 
## V8          -1.549e-01  2.178e-02  -7.115 1.12e-12 ***
## V10         -9.290e-01  9.305e-02  -9.985  &amp;lt; 2e-16 ***
## V13         -3.307e-01  8.577e-02  -3.855 0.000116 ***
## V14         -5.229e-01  5.566e-02  -9.396  &amp;lt; 2e-16 ***
## V20         -2.388e-01  6.005e-02  -3.976 7.01e-05 ***
## V21          4.811e-01  5.259e-02   9.148  &amp;lt; 2e-16 ***
## V22          7.675e-01  1.277e-01   6.011 1.84e-09 ***
## V23         -1.522e-01  5.925e-02  -2.569 0.010212 *  
## V27         -6.381e-01  1.295e-01  -4.927 8.34e-07 ***
## V28         -2.485e-01  9.881e-02  -2.515 0.011900 *  
## Amount       2.713e-07  1.290e-04   0.002 0.998323    
## ---
## Signif. codes:  0 &amp;#39;***&amp;#39; 0.001 &amp;#39;**&amp;#39; 0.01 &amp;#39;*&amp;#39; 0.05 &amp;#39;.&amp;#39; 0.1 &amp;#39; &amp;#39; 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 5799.1  on 227845  degrees of freedom
## Residual deviance: 1798.7  on 227831  degrees of freedom
## AIC: 1828.7
## 
## Number of Fisher Scoring iterations: 11&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;We have now two predictors that are non significant &lt;strong&gt;V1&lt;/strong&gt; and &lt;strong&gt;Amount&lt;/strong&gt;, they should be also removed.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;set.seed(123)
finalmodel &amp;lt;- glm(Class ~ . - Time - V1 - V2 - V3 - V6 - V7 - V9 - V11 - V12 - V15 - 
    V16 - V17 - V18 - V19 - V24 - V25 - V26 - Amount, data = train, family = &amp;quot;binomial&amp;quot;)
summary(finalmodel)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## 
## Call:
## glm(formula = Class ~ . - Time - V1 - V2 - V3 - V6 - V7 - V9 - 
##     V11 - V12 - V15 - V16 - V17 - V18 - V19 - V24 - V25 - V26 - 
##     Amount, family = &amp;quot;binomial&amp;quot;, data = train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -4.6285  -0.0289  -0.0186  -0.0117   4.5835  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(&amp;gt;|z|)    
## (Intercept) -8.75058    0.14706 -59.505  &amp;lt; 2e-16 ***
## V4           0.69955    0.05265  13.288  &amp;lt; 2e-16 ***
## V5           0.10650    0.02586   4.119 3.81e-05 ***
## V8          -0.15525    0.01982  -7.833 4.76e-15 ***
## V10         -0.89573    0.07630 -11.740  &amp;lt; 2e-16 ***
## V13         -0.33583    0.08448  -3.975 7.02e-05 ***
## V14         -0.54238    0.04862 -11.155  &amp;lt; 2e-16 ***
## V20         -0.22318    0.04781  -4.668 3.04e-06 ***
## V21          0.47912    0.05205   9.204  &amp;lt; 2e-16 ***
## V22          0.78631    0.12439   6.321 2.60e-10 ***
## V23         -0.15046    0.05498  -2.736  0.00621 ** 
## V27         -0.58832    0.10411  -5.651 1.60e-08 ***
## V28         -0.23592    0.08901  -2.651  0.00804 ** 
## ---
## Signif. codes:  0 &amp;#39;***&amp;#39; 0.001 &amp;#39;**&amp;#39; 0.01 &amp;#39;*&amp;#39; 0.05 &amp;#39;.&amp;#39; 0.1 &amp;#39; &amp;#39; 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 5799.1  on 227845  degrees of freedom
## Residual deviance: 1799.2  on 227833  degrees of freedom
## AIC: 1825.2
## 
## Number of Fisher Scoring iterations: 11&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;For the other training sets we will use only these significant predictors from the above model.&lt;/p&gt;
&lt;p&gt;Now let’s get the final results from the confusion matrix.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;pred &amp;lt;- predict(finalmodel, test, type = &amp;quot;response&amp;quot;)
pred &amp;lt;- as.integer(pred &amp;gt; 0.5)
confusionMatrix(as.factor(pred), test$Class)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## Confusion Matrix and Statistics
## 
##           Reference
## Prediction     0     1
##          0 56856    41
##          1     7    57
##                                           
##                Accuracy : 0.9992          
##                  95% CI : (0.9989, 0.9994)
##     No Information Rate : 0.9983          
##     P-Value [Acc &amp;gt; NIR] : 1.581e-08       
##                                           
##                   Kappa : 0.7033          
##                                           
##  Mcnemar&amp;#39;s Test P-Value : 1.906e-06       
##                                           
##             Sensitivity : 0.9999          
##             Specificity : 0.5816          
##          Pos Pred Value : 0.9993          
##          Neg Pred Value : 0.8906          
##              Prevalence : 0.9983          
##          Detection Rate : 0.9982          
##    Detection Prevalence : 0.9989          
##       Balanced Accuracy : 0.7908          
##                                           
##        &amp;#39;Positive&amp;#39; Class : 0               
## &lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;As we see we have a large accuracy rate about &lt;strong&gt;99.92%&lt;/strong&gt;. However, this rate is almost the same as the no information rate &lt;strong&gt;99.83%&lt;/strong&gt; (if we predict all the cases as class label 0). In other words this high rate is not due to the quality of the model but rather due to the imbalanced classes.
if we look at the specificity rate. it is about &lt;strong&gt;58.16%&lt;/strong&gt; indicating that the model poorly predict the fraudulent cards which is the most important class label that we want to predict correctly.
Among the available metrics, the best one for imbalanced data is &lt;a href=&#34;https://towardsdatascience.com/interpretation-of-kappa-values-2acd1ca7b18f&#34;&gt;cohen’s kappa&lt;/a&gt; statistic. and according to the scale of kappa value interpretation suggested by Landis &amp;amp; Koch (1977), the kappa value obtained here &lt;strong&gt;0.7033&lt;/strong&gt; is a good score.&lt;/p&gt;
&lt;p&gt;But here we stick with accuracy rate for pedagogic purposes to show the effectiveness of the above discussed methods.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;upsampling-the-train-set&#34; class=&#34;section level2&#34; number=&#34;4.2&#34;&gt;
&lt;h2&gt;&lt;span class=&#34;header-section-number&#34;&gt;4.2&lt;/span&gt; Upsampling the train set&lt;/h2&gt;
&lt;p&gt;Now let’s use the training data resulted from the upsmpling method.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;set.seed(123)
modelup &amp;lt;- glm(Class ~ V4 + V5 + V8 + V10 + V13 + V14 + V20 + V21 + V22 + V23 + V27 + 
    V28, data = trainup, family = &amp;quot;binomial&amp;quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;summary(modelup)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## 
## Call:
## glm(formula = Class ~ V4 + V5 + V8 + V10 + V13 + V14 + V20 + 
##     V21 + V22 + V23 + V27 + V28, family = &amp;quot;binomial&amp;quot;, data = trainup)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -6.2906  -0.2785  -0.0001   0.0159   2.8055  
## 
## Coefficients:
##              Estimate Std. Error  z value Pr(&amp;gt;|z|)    
## (Intercept) -3.271053   0.011741 -278.610  &amp;lt; 2e-16 ***
## V4           0.952941   0.005478  173.966  &amp;lt; 2e-16 ***
## V5           0.126627   0.003976   31.846  &amp;lt; 2e-16 ***
## V8          -0.289448   0.004368  -66.261  &amp;lt; 2e-16 ***
## V10         -0.710629   0.009150  -77.665  &amp;lt; 2e-16 ***
## V13         -0.479344   0.007352  -65.200  &amp;lt; 2e-16 ***
## V14         -0.802941   0.006825 -117.638  &amp;lt; 2e-16 ***
## V20         -0.090453   0.007955  -11.371  &amp;lt; 2e-16 ***
## V21          0.233604   0.007702   30.332  &amp;lt; 2e-16 ***
## V22          0.209203   0.010125   20.662  &amp;lt; 2e-16 ***
## V23         -0.320073   0.005299  -60.399  &amp;lt; 2e-16 ***
## V27         -0.238132   0.017019  -13.992  &amp;lt; 2e-16 ***
## V28         -0.152294   0.019922   -7.644  2.1e-14 ***
## ---
## Signif. codes:  0 &amp;#39;***&amp;#39; 0.001 &amp;#39;**&amp;#39; 0.01 &amp;#39;*&amp;#39; 0.05 &amp;#39;.&amp;#39; 0.1 &amp;#39; &amp;#39; 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 630631  on 454903  degrees of freedom
## Residual deviance: 136321  on 454891  degrees of freedom
## AIC: 136347
## 
## Number of Fisher Scoring iterations: 9&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;pred &amp;lt;- predict(modelup, test, type = &amp;quot;response&amp;quot;)
pred &amp;lt;- as.integer(pred &amp;gt; 0.5)
confusionMatrix(as.factor(pred), test$Class)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## Confusion Matrix and Statistics
## 
##           Reference
## Prediction     0     1
##          0 55334    12
##          1  1529    86
##                                           
##                Accuracy : 0.9729          
##                  95% CI : (0.9716, 0.9743)
##     No Information Rate : 0.9983          
##     P-Value [Acc &amp;gt; NIR] : 1               
##                                           
##                   Kappa : 0.0975          
##                                           
##  Mcnemar&amp;#39;s Test P-Value : &amp;lt;2e-16          
##                                           
##             Sensitivity : 0.97311         
##             Specificity : 0.87755         
##          Pos Pred Value : 0.99978         
##          Neg Pred Value : 0.05325         
##              Prevalence : 0.99828         
##          Detection Rate : 0.97144         
##    Detection Prevalence : 0.97165         
##       Balanced Accuracy : 0.92533         
##                                           
##        &amp;#39;Positive&amp;#39; Class : 0               
## &lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Now we have a smaller accuracy rate &lt;strong&gt;97.29%&lt;/strong&gt;, but we have a larger specificity rate &lt;strong&gt;87.75%&lt;/strong&gt; which increases the power of the model to predict the fraudulent cards.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;down-sampling-the-training-set.&#34; class=&#34;section level2&#34; number=&#34;4.3&#34;&gt;
&lt;h2&gt;&lt;span class=&#34;header-section-number&#34;&gt;4.3&lt;/span&gt; Down sampling the training set.&lt;/h2&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;set.seed(123)
modeldown &amp;lt;- glm(Class ~ V4 + V5 + V8 + V10 + V13 + V14 + V20 + V21 + V22 + V23 + 
    V27 + V28, data = traindown, family = &amp;quot;binomial&amp;quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;pred &amp;lt;- predict(modeldown, test, type = &amp;quot;response&amp;quot;)
pred &amp;lt;- as.integer(pred &amp;gt; 0.5)
confusionMatrix(as.factor(pred), test$Class)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## Confusion Matrix and Statistics
## 
##           Reference
## Prediction     0     1
##          0 54837    12
##          1  2026    86
##                                           
##                Accuracy : 0.9642          
##                  95% CI : (0.9627, 0.9657)
##     No Information Rate : 0.9983          
##     P-Value [Acc &amp;gt; NIR] : 1               
##                                           
##                   Kappa : 0.0748          
##                                           
##  Mcnemar&amp;#39;s Test P-Value : &amp;lt;2e-16          
##                                           
##             Sensitivity : 0.96437         
##             Specificity : 0.87755         
##          Pos Pred Value : 0.99978         
##          Neg Pred Value : 0.04072         
##              Prevalence : 0.99828         
##          Detection Rate : 0.96271         
##    Detection Prevalence : 0.96292         
##       Balanced Accuracy : 0.92096         
##                                           
##        &amp;#39;Positive&amp;#39; Class : 0               
## &lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;With downsampling method, we get approximately the same specificity rate &lt;strong&gt;87.75%&lt;/strong&gt; with a slight decrease of the over all accuracy rate &lt;strong&gt;96.42%&lt;/strong&gt;, and the sensitivity rate has decreased to &lt;strong&gt;96.43%&lt;/strong&gt; since we have decreased the majority class size by downsampling.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;subsampline-the-train-set-by-rose-technique&#34; class=&#34;section level2&#34; number=&#34;4.4&#34;&gt;
&lt;h2&gt;&lt;span class=&#34;header-section-number&#34;&gt;4.4&lt;/span&gt; subsampline the train set by ROSE technique&lt;/h2&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;set.seed(123)
modelrose &amp;lt;- glm(Class ~ V4 + V5 + V8 + V10 + V13 + V14 + V20 + V21 + V22 + V23 + 
    V27 + V28, data = trainrose, family = &amp;quot;binomial&amp;quot;)
pred &amp;lt;- predict(modelrose, test, type = &amp;quot;response&amp;quot;)
pred &amp;lt;- as.integer(pred &amp;gt; 0.5)
confusionMatrix(as.factor(pred), test$Class)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## Confusion Matrix and Statistics
## 
##           Reference
## Prediction     0     1
##          0 56080    14
##          1   783    84
##                                         
##                Accuracy : 0.986         
##                  95% CI : (0.985, 0.987)
##     No Information Rate : 0.9983        
##     P-Value [Acc &amp;gt; NIR] : 1             
##                                         
##                   Kappa : 0.1715        
##                                         
##  Mcnemar&amp;#39;s Test P-Value : &amp;lt;2e-16        
##                                         
##             Sensitivity : 0.98623       
##             Specificity : 0.85714       
##          Pos Pred Value : 0.99975       
##          Neg Pred Value : 0.09689       
##              Prevalence : 0.99828       
##          Detection Rate : 0.98453       
##    Detection Prevalence : 0.98478       
##       Balanced Accuracy : 0.92169       
##                                         
##        &amp;#39;Positive&amp;#39; Class : 0             
## &lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Using this method the sensitivity rate is slightly smaller than the previous ones &lt;strong&gt;85.71%&lt;/strong&gt; but still a large improvement in predicting fraudulent cards compared to the model trained with the original imbalanced data.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;subsampling-the-train-set-by-smote-technique&#34; class=&#34;section level2&#34; number=&#34;4.5&#34;&gt;
&lt;h2&gt;&lt;span class=&#34;header-section-number&#34;&gt;4.5&lt;/span&gt; Subsampling the train set by SMOTE technique&lt;/h2&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;set.seed(123)
modelsmote &amp;lt;- glm(Class ~ V4 + V5 + V8 + V10 + V13 + V14 + V20 + V21 + V22 + V23 + 
    V27 + V28, data = trainsmote, family = &amp;quot;binomial&amp;quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;pred &amp;lt;- predict(modelsmote, test, type = &amp;quot;response&amp;quot;)
pred &amp;lt;- as.integer(pred &amp;gt; 0.5)
confusionMatrix(as.factor(pred), test$Class)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## Confusion Matrix and Statistics
## 
##           Reference
## Prediction     0     1
##          0 55457    14
##          1  1406    84
##                                           
##                Accuracy : 0.9751          
##                  95% CI : (0.9738, 0.9763)
##     No Information Rate : 0.9983          
##     P-Value [Acc &amp;gt; NIR] : 1               
##                                           
##                   Kappa : 0.1029          
##                                           
##  Mcnemar&amp;#39;s Test P-Value : &amp;lt;2e-16          
##                                           
##             Sensitivity : 0.97527         
##             Specificity : 0.85714         
##          Pos Pred Value : 0.99975         
##          Neg Pred Value : 0.05638         
##              Prevalence : 0.99828         
##          Detection Rate : 0.97360         
##    Detection Prevalence : 0.97384         
##       Balanced Accuracy : 0.91621         
##                                           
##        &amp;#39;Positive&amp;#39; Class : 0               
## &lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;With this method we get the same specificity rate &lt;strong&gt;85.71%&lt;/strong&gt; such as ROSE method.&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div id=&#34;deep-learning-model-without-class-weight.&#34; class=&#34;section level1&#34; number=&#34;5&#34;&gt;
&lt;h1&gt;&lt;span class=&#34;header-section-number&#34;&gt;5&lt;/span&gt; deep learning model (without class weight).&lt;/h1&gt;
&lt;p&gt;When we use deep learning models via some software we can assign a weight to the labels of the target variables. For us we will make use of &lt;a href=&#34;https://keras.rstudio.com&#34;&gt;keras&lt;/a&gt; package. We will first train the model without weighting the data , Then we retrain the same model after assigning weight to the minority class.&lt;br /&gt;
To train this model we should first convert the data (train and test sets) into numeric matrix and remove the column names (we convert also the &lt;strong&gt;Class&lt;/strong&gt; to numeric type). However, in order to be inline with the above models we keep only their features, but this time it would be better to be normalized since this helps the gradient running more faster.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;spsm(library(keras))
train1 &amp;lt;- train[, c(&amp;quot;V4&amp;quot;, &amp;quot;V5&amp;quot;, &amp;quot;V8&amp;quot;, &amp;quot;V10&amp;quot;, &amp;quot;V13&amp;quot;, &amp;quot;V14&amp;quot;, &amp;quot;V20&amp;quot;, &amp;quot;V21&amp;quot;, &amp;quot;V22&amp;quot;, &amp;quot;V23&amp;quot;, 
    &amp;quot;V27&amp;quot;, &amp;quot;V28&amp;quot;, &amp;quot;Class&amp;quot;)]
test1 &amp;lt;- test[, c(&amp;quot;V4&amp;quot;, &amp;quot;V5&amp;quot;, &amp;quot;V8&amp;quot;, &amp;quot;V10&amp;quot;, &amp;quot;V13&amp;quot;, &amp;quot;V14&amp;quot;, &amp;quot;V20&amp;quot;, &amp;quot;V21&amp;quot;, &amp;quot;V22&amp;quot;, &amp;quot;V23&amp;quot;, 
    &amp;quot;V27&amp;quot;, &amp;quot;V28&amp;quot;, &amp;quot;Class&amp;quot;)]
train1$Class &amp;lt;- as.numeric(train1$Class)
test1$Class &amp;lt;- as.numeric(test1$Class)
train1[, &amp;quot;Class&amp;quot;] &amp;lt;- train1[, &amp;quot;Class&amp;quot;] - 1
test1[, &amp;quot;Class&amp;quot;] &amp;lt;- test1[, &amp;quot;Class&amp;quot;] - 1
trainx &amp;lt;- train1[, -ncol(train1)]
testx &amp;lt;- test1[, -ncol(test1)]
trained &amp;lt;- as.matrix(trainx)
tested &amp;lt;- as.matrix(testx)
trainy &amp;lt;- train1$Class
testy &amp;lt;- test1$Class
dimnames(trained) &amp;lt;- NULL
dimnames(tested) &amp;lt;- NULL&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;then we apply one hot encoding on the target variable.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;trainlabel &amp;lt;- to_categorical(trainy)
testlabel &amp;lt;- to_categorical(testy)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The final step now is normalizing the matrices (trained and tested)&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;trained1 &amp;lt;- normalize(trained)
tested1 &amp;lt;- normalize(tested)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Now we are ready to create the model with two hidden layers followed by &lt;a href=&#34;https://keras.rstudio.com/reference/index.html#section-dropout-layers&#34;&gt;dropout layers&lt;/a&gt;.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;modeldeep &amp;lt;- keras_model_sequential()
modeldeep %&amp;gt;% layer_dense(units = 32, activation = &amp;quot;relu&amp;quot;, kernel_initializer = &amp;quot;he_normal&amp;quot;, 
    input_shape = c(12)) %&amp;gt;% layer_dropout(rate = 0.2) %&amp;gt;% layer_dense(units = 64, 
    activation = &amp;quot;relu&amp;quot;, kernel_initializer = &amp;quot;he_normal&amp;quot;) %&amp;gt;% layer_dropout(rate = 0.4) %&amp;gt;% 
    layer_dense(units = 2, activation = &amp;quot;sigmoid&amp;quot;)
summary(modeldeep)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## Model: &amp;quot;sequential&amp;quot;
## ________________________________________________________________________________
## Layer (type)                        Output Shape                    Param #     
## ================================================================================
## dense (Dense)                       (None, 32)                      416         
## ________________________________________________________________________________
## dropout (Dropout)                   (None, 32)                      0           
## ________________________________________________________________________________
## dense_1 (Dense)                     (None, 64)                      2112        
## ________________________________________________________________________________
## dropout_1 (Dropout)                 (None, 64)                      0           
## ________________________________________________________________________________
## dense_2 (Dense)                     (None, 2)                       130         
## ================================================================================
## Total params: 2,658
## Trainable params: 2,658
## Non-trainable params: 0
## ________________________________________________________________________________&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;we will use the &lt;strong&gt;accuracy&lt;/strong&gt; rate as the metric. The loss function will be &lt;strong&gt;binary crossentropy&lt;/strong&gt; since we deal with binary classification problem. and for the optimizer we will use &lt;a href=&#34;https://arxiv.org/pdf/1412.6980v8.pdf&#34;&gt;adam&lt;/a&gt; optimizer.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;modeldeep %&amp;gt;% compile(loss = &amp;quot;binary_crossentropy&amp;quot;, optimizer = &amp;quot;adam&amp;quot;, metric = &amp;quot;accuracy&amp;quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;During training, the model will use 10 epochs (the default), 5 sample as batch size to update the weights, and keep 20% of the inputs (training samples) out to assess the model&lt;/p&gt;
&lt;p&gt;You can run this model many times untill you get satisfied with the results, then it will be better to save it and load it again each time you need it as follows.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;modeldeep &amp;lt;- load_model_hdf5(&amp;quot;modeldeep.h5&amp;quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;All the above metric values are used in the training process, so they are not much reliable. The more reliable ones are those computed from unseen data.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;pred &amp;lt;- modeldeep %&amp;gt;% predict_classes(tested1)
confusionMatrix(as.factor(pred), as.factor(testy))&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## Confusion Matrix and Statistics
## 
##           Reference
## Prediction     0     1
##          0 56858    64
##          1     5    34
##                                           
##                Accuracy : 0.9988          
##                  95% CI : (0.9985, 0.9991)
##     No Information Rate : 0.9983          
##     P-Value [Acc &amp;gt; NIR] : 0.00125         
##                                           
##                   Kappa : 0.4959          
##                                           
##  Mcnemar&amp;#39;s Test P-Value : 2.902e-12       
##                                           
##             Sensitivity : 0.9999          
##             Specificity : 0.3469          
##          Pos Pred Value : 0.9989          
##          Neg Pred Value : 0.8718          
##              Prevalence : 0.9983          
##          Detection Rate : 0.9982          
##    Detection Prevalence : 0.9993          
##       Balanced Accuracy : 0.6734          
##                                           
##        &amp;#39;Positive&amp;#39; Class : 0               
## &lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The same as the above models, the specificity rate is even worst than the other models &lt;strong&gt;0.3469&lt;/strong&gt; which is also caused by the imbalanced data.&lt;/p&gt;
&lt;div id=&#34;deep-learning-model-with-class-weights&#34; class=&#34;section level2&#34; number=&#34;5.1&#34;&gt;
&lt;h2&gt;&lt;span class=&#34;header-section-number&#34;&gt;5.1&lt;/span&gt; deep learning model with class weights&lt;/h2&gt;
&lt;p&gt;Now let’s try the previous model by taking into account the class imbalance&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;modeldeep1 &amp;lt;- keras_model_sequential()
modeldeep1 %&amp;gt;% layer_dense(units = 32, activation = &amp;quot;relu&amp;quot;, kernel_initializer = &amp;quot;he_normal&amp;quot;, 
    input_shape = c(12)) %&amp;gt;% layer_dropout(rate = 0.2) %&amp;gt;% layer_dense(units = 64, 
    activation = &amp;quot;relu&amp;quot;, kernel_initializer = &amp;quot;he_normal&amp;quot;) %&amp;gt;% layer_dropout(rate = 0.4) %&amp;gt;% 
    layer_dense(units = 2, activation = &amp;quot;sigmoid&amp;quot;)
modeldeep1 %&amp;gt;% compile(loss = &amp;quot;binary_crossentropy&amp;quot;, optimizer = &amp;quot;adam&amp;quot;, metric = &amp;quot;accuracy&amp;quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;To define the appropriate weight, we divide the fraction of the majority class by the fraction of the minority class to get how many times the former is larger than the latter.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;prop.table(table(data$Class))[1]/prop.table(table(data$Class))[2]&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;##       0 
## 577.876&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Now we include this value as weight in the &lt;strong&gt;class_weight&lt;/strong&gt; argument.&lt;/p&gt;
&lt;p&gt;Again I should save this model before knitting the document. For you if you want to run the above code just uncomment it.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;modeldeep1 &amp;lt;- load_model_hdf5(&amp;quot;modeldeep1.h5&amp;quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Now let’s get the confusion matrix.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;pred &amp;lt;- modeldeep1 %&amp;gt;% predict_classes(tested1)
confusionMatrix(as.factor(pred), as.factor(testy))&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## Confusion Matrix and Statistics
## 
##           Reference
## Prediction     0     1
##          0 55303    14
##          1  1560    84
##                                          
##                Accuracy : 0.9724         
##                  95% CI : (0.971, 0.9737)
##     No Information Rate : 0.9983         
##     P-Value [Acc &amp;gt; NIR] : 1              
##                                          
##                   Kappa : 0.0935         
##                                          
##  Mcnemar&amp;#39;s Test P-Value : &amp;lt;2e-16         
##                                          
##             Sensitivity : 0.97257        
##             Specificity : 0.85714        
##          Pos Pred Value : 0.99975        
##          Neg Pred Value : 0.05109        
##              Prevalence : 0.99828        
##          Detection Rate : 0.97089        
##    Detection Prevalence : 0.97114        
##       Balanced Accuracy : 0.91485        
##                                          
##        &amp;#39;Positive&amp;#39; Class : 0              
## &lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Using this model we get less accuracy rate &lt;strong&gt;0.9724&lt;/strong&gt;, but the specificity rate is higher compared to the previous model so that this model can well predict the negative class label as well as the postive class label.&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div id=&#34;conclusion&#34; class=&#34;section level1&#34; number=&#34;6&#34;&gt;
&lt;h1&gt;&lt;span class=&#34;header-section-number&#34;&gt;6&lt;/span&gt; Conclusion&lt;/h1&gt;
&lt;p&gt;With the imbalanced data most machine learning model tend to more efficiently predict the majority class than the minority class. To correct thus this behavior we can use one of the above discussed methods to get more closer accuracy rates between classes. However, deep learning model can easily handle this problem by specifying the class weights.&lt;/p&gt;
&lt;/div&gt;
</description>
    </item>
    
    <item>
      <title>An example preprint / working paper</title>
      <link>https://modelingwithr.rbind.io/publication/preprint/</link>
      <pubDate>Sun, 07 Apr 2019 00:00:00 +0000</pubDate>
      <guid>https://modelingwithr.rbind.io/publication/preprint/</guid>
      <description>&lt;div class=&#34;alert alert-note&#34;&gt;
  &lt;div&gt;
    Click the &lt;em&gt;Slides&lt;/em&gt; button above to demo Academic&amp;rsquo;s Markdown slides feature.
  &lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;Supplementary notes can be added here, including 
&lt;a href=&#34;https://sourcethemes.com/academic/docs/writing-markdown-latex/&#34; target=&#34;_blank&#34; rel=&#34;noopener&#34;&gt;code and math&lt;/a&gt;.&lt;/p&gt;
</description>
    </item>
    
    <item>
      <title>Slides</title>
      <link>https://modelingwithr.rbind.io/slides/example/</link>
      <pubDate>Tue, 05 Feb 2019 00:00:00 +0000</pubDate>
      <guid>https://modelingwithr.rbind.io/slides/example/</guid>
      <description>&lt;h1 id=&#34;create-slides-in-markdown-with-academic&#34;&gt;Create slides in Markdown with Academic&lt;/h1&gt;
&lt;p&gt;
&lt;a href=&#34;https://sourcethemes.com/academic/&#34; target=&#34;_blank&#34; rel=&#34;noopener&#34;&gt;Academic&lt;/a&gt; | 
&lt;a href=&#34;https://sourcethemes.com/academic/docs/managing-content/#create-slides&#34; target=&#34;_blank&#34; rel=&#34;noopener&#34;&gt;Documentation&lt;/a&gt;&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id=&#34;features&#34;&gt;Features&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;Efficiently write slides in Markdown&lt;/li&gt;
&lt;li&gt;3-in-1: Create, Present, and Publish your slides&lt;/li&gt;
&lt;li&gt;Supports speaker notes&lt;/li&gt;
&lt;li&gt;Mobile friendly slides&lt;/li&gt;
&lt;/ul&gt;
&lt;hr&gt;
&lt;h2 id=&#34;controls&#34;&gt;Controls&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;Next: &lt;code&gt;Right Arrow&lt;/code&gt; or &lt;code&gt;Space&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;Previous: &lt;code&gt;Left Arrow&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;Start: &lt;code&gt;Home&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;Finish: &lt;code&gt;End&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;Overview: &lt;code&gt;Esc&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;Speaker notes: &lt;code&gt;S&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;Fullscreen: &lt;code&gt;F&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;Zoom: &lt;code&gt;Alt + Click&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;
&lt;a href=&#34;https://github.com/hakimel/reveal.js#pdf-export&#34; target=&#34;_blank&#34; rel=&#34;noopener&#34;&gt;PDF Export&lt;/a&gt;: &lt;code&gt;E&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;hr&gt;
&lt;h2 id=&#34;code-highlighting&#34;&gt;Code Highlighting&lt;/h2&gt;
&lt;p&gt;Inline code: &lt;code&gt;variable&lt;/code&gt;&lt;/p&gt;
&lt;p&gt;Code block:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&#34;language-python&#34;&gt;porridge = &amp;quot;blueberry&amp;quot;
if porridge == &amp;quot;blueberry&amp;quot;:
    print(&amp;quot;Eating...&amp;quot;)
&lt;/code&gt;&lt;/pre&gt;
&lt;hr&gt;
&lt;h2 id=&#34;math&#34;&gt;Math&lt;/h2&gt;
&lt;p&gt;In-line math: $x + y = z$&lt;/p&gt;
&lt;p&gt;Block math:&lt;/p&gt;
&lt;p&gt;$$
f\left( x \right) = ;\frac{{2\left( {x + 4} \right)\left( {x - 4} \right)}}{{\left( {x + 4} \right)\left( {x + 1} \right)}}
$$&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id=&#34;fragments&#34;&gt;Fragments&lt;/h2&gt;
&lt;p&gt;Make content appear incrementally&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;{{% fragment %}} One {{% /fragment %}}
{{% fragment %}} **Two** {{% /fragment %}}
{{% fragment %}} Three {{% /fragment %}}
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Press &lt;code&gt;Space&lt;/code&gt; to play!&lt;/p&gt;
&lt;p&gt;&lt;span class=&#34;fragment &#34; &gt;
One
&lt;/span&gt;
&lt;span class=&#34;fragment &#34; &gt;
&lt;strong&gt;Two&lt;/strong&gt;
&lt;/span&gt;
&lt;span class=&#34;fragment &#34; &gt;
Three
&lt;/span&gt;&lt;/p&gt;
&lt;hr&gt;
&lt;p&gt;A fragment can accept two optional parameters:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;class&lt;/code&gt;: use a custom style (requires definition in custom CSS)&lt;/li&gt;
&lt;li&gt;&lt;code&gt;weight&lt;/code&gt;: sets the order in which a fragment appears&lt;/li&gt;
&lt;/ul&gt;
&lt;hr&gt;
&lt;h2 id=&#34;speaker-notes&#34;&gt;Speaker Notes&lt;/h2&gt;
&lt;p&gt;Add speaker notes to your presentation&lt;/p&gt;
&lt;pre&gt;&lt;code class=&#34;language-markdown&#34;&gt;{{% speaker_note %}}
- Only the speaker can read these notes
- Press `S` key to view
{{% /speaker_note %}}
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Press the &lt;code&gt;S&lt;/code&gt; key to view the speaker notes!&lt;/p&gt;
&lt;aside class=&#34;notes&#34;&gt;
  &lt;ul&gt;
&lt;li&gt;Only the speaker can read these notes&lt;/li&gt;
&lt;li&gt;Press &lt;code&gt;S&lt;/code&gt; key to view&lt;/li&gt;
&lt;/ul&gt;

&lt;/aside&gt;
&lt;hr&gt;
&lt;h2 id=&#34;themes&#34;&gt;Themes&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;black: Black background, white text, blue links (default)&lt;/li&gt;
&lt;li&gt;white: White background, black text, blue links&lt;/li&gt;
&lt;li&gt;league: Gray background, white text, blue links&lt;/li&gt;
&lt;li&gt;beige: Beige background, dark text, brown links&lt;/li&gt;
&lt;li&gt;sky: Blue background, thin dark text, blue links&lt;/li&gt;
&lt;/ul&gt;
&lt;hr&gt;
&lt;ul&gt;
&lt;li&gt;night: Black background, thick white text, orange links&lt;/li&gt;
&lt;li&gt;serif: Cappuccino background, gray text, brown links&lt;/li&gt;
&lt;li&gt;simple: White background, black text, blue links&lt;/li&gt;
&lt;li&gt;solarized: Cream-colored background, dark green text, blue links&lt;/li&gt;
&lt;/ul&gt;
&lt;hr&gt;

&lt;section data-noprocess data-shortcode-slide
  
      
      data-background-image=&#34;/img/boards.jpg&#34;
  &gt;

&lt;h2 id=&#34;custom-slide&#34;&gt;Custom Slide&lt;/h2&gt;
&lt;p&gt;Customize the slide style and background&lt;/p&gt;
&lt;pre&gt;&lt;code class=&#34;language-markdown&#34;&gt;{{&amp;lt; slide background-image=&amp;quot;/img/boards.jpg&amp;quot; &amp;gt;}}
{{&amp;lt; slide background-color=&amp;quot;#0000FF&amp;quot; &amp;gt;}}
{{&amp;lt; slide class=&amp;quot;my-style&amp;quot; &amp;gt;}}
&lt;/code&gt;&lt;/pre&gt;
&lt;hr&gt;
&lt;h2 id=&#34;custom-css-example&#34;&gt;Custom CSS Example&lt;/h2&gt;
&lt;p&gt;Let&amp;rsquo;s make headers navy colored.&lt;/p&gt;
&lt;p&gt;Create &lt;code&gt;assets/css/reveal_custom.css&lt;/code&gt; with:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&#34;language-css&#34;&gt;.reveal section h1,
.reveal section h2,
.reveal section h3 {
  color: navy;
}
&lt;/code&gt;&lt;/pre&gt;
&lt;hr&gt;
&lt;h1 id=&#34;questions&#34;&gt;Questions?&lt;/h1&gt;
&lt;p&gt;
&lt;a href=&#34;https://spectrum.chat/academic&#34; target=&#34;_blank&#34; rel=&#34;noopener&#34;&gt;Ask&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;
&lt;a href=&#34;https://sourcethemes.com/academic/docs/managing-content/#create-slides&#34; target=&#34;_blank&#34; rel=&#34;noopener&#34;&gt;Documentation&lt;/a&gt;&lt;/p&gt;
</description>
    </item>
    
    <item>
      <title>Introduction to sparklyr</title>
      <link>https://modelingwithr.rbind.io/sparklyr/sparklyr/</link>
      <pubDate>Wed, 23 Jan 2019 00:00:00 +0000</pubDate>
      <guid>https://modelingwithr.rbind.io/sparklyr/sparklyr/</guid>
      <description>
&lt;script src=&#34;https://modelingwithr.rbind.io/rmarkdown-libs/header-attrs/header-attrs.js&#34;&gt;&lt;/script&gt;

&lt;div id=&#34;TOC&#34;&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&#34;#introduction&#34;&gt;&lt;span class=&#34;toc-section-number&#34;&gt;1&lt;/span&gt; Introduction&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#installing-sparklyr&#34;&gt;&lt;span class=&#34;toc-section-number&#34;&gt;2&lt;/span&gt; Installing sparklyr&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#installing-spark&#34;&gt;&lt;span class=&#34;toc-section-number&#34;&gt;3&lt;/span&gt; Installing spark&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#connecting-to-spark&#34;&gt;&lt;span class=&#34;toc-section-number&#34;&gt;4&lt;/span&gt; Connecting to spark&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#importing-data&#34;&gt;&lt;span class=&#34;toc-section-number&#34;&gt;5&lt;/span&gt; Importing data&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#manipulating-data&#34;&gt;&lt;span class=&#34;toc-section-number&#34;&gt;6&lt;/span&gt; Manipulating data&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#disconnecting&#34;&gt;&lt;span class=&#34;toc-section-number&#34;&gt;7&lt;/span&gt; Disconnecting&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#saving-data&#34;&gt;&lt;span class=&#34;toc-section-number&#34;&gt;8&lt;/span&gt; saving data&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#example-of-modeling-in-spark&#34;&gt;&lt;span class=&#34;toc-section-number&#34;&gt;9&lt;/span&gt; Example of modeling in spark&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#streaming&#34;&gt;&lt;span class=&#34;toc-section-number&#34;&gt;10&lt;/span&gt; Streaming&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;

&lt;style type=&#34;text/css&#34;&gt;
strong {
  color: Navy;
}

h1,h2, h3, h4 {
  font-size:28px;
  color:DarkBlue;
}
&lt;/style&gt;
&lt;div id=&#34;introduction&#34; class=&#34;section level1&#34; number=&#34;1&#34;&gt;
&lt;h1&gt;&lt;span class=&#34;header-section-number&#34;&gt;1&lt;/span&gt; Introduction&lt;/h1&gt;
&lt;p&gt;The programming language R has very powerful tools and functions to do almost every thing we want to do, such as wrangling , visualizing, modeling…etc. However, R such as all the classical languages, requires the whole data to be completely loaded into its memory before doing anything, and this is a big disadvantage when we deal with large data set using less powerful machine, so that any even small data manipulation is time consuming, and may be in some cases the data size can exceed the memory size and R fails even to load the data.&lt;/p&gt;
&lt;p&gt;However, there are two widely used engines for this type of data &lt;strong&gt;hadoop&lt;/strong&gt; and &lt;strong&gt;spark&lt;/strong&gt; which both use a distributed system to partition the data into different storage locations and distribute any computation processes among different machines (computing clusters), or among different CPU’s inside a single machine.&lt;/p&gt;
&lt;p&gt;Spark is more recent and recognized to be more faster than hadoop (2010). &lt;strong&gt;scala&lt;/strong&gt; is its native language, but it can also support &lt;strong&gt;SQL&lt;/strong&gt; and &lt;strong&gt;java&lt;/strong&gt;. If you do not know neither spark nor hadoop it would be obvious to choose spark . However, if you are R user and you do not want to spent time to learn the spark languages (scala, or sql) good news for you is that &lt;strong&gt;sparklyr&lt;/strong&gt; package (or sparkR) is R interface for spark from which you can use the most of the R codes and other functions from some packages such as dplyr …etc.&lt;/p&gt;
&lt;p&gt;In this paper we will go step by step to learn how to use sparklyr by making use of some examples .&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;installing-sparklyr&#34; class=&#34;section level1&#34; number=&#34;2&#34;&gt;
&lt;h1&gt;&lt;span class=&#34;header-section-number&#34;&gt;2&lt;/span&gt; Installing sparklyr&lt;/h1&gt;
&lt;p&gt;Such as any R package we call the function &lt;strong&gt;install.packages&lt;/strong&gt;
to install sparklyr, but before that make sure you have &lt;strong&gt;java&lt;/strong&gt; installed in your system since the programming language &lt;strong&gt;scala&lt;/strong&gt; is run by the java virtual machine.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;#install.packages(&amp;quot;sparklyr&amp;quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div id=&#34;installing-spark&#34; class=&#34;section level1&#34; number=&#34;3&#34;&gt;
&lt;h1&gt;&lt;span class=&#34;header-section-number&#34;&gt;3&lt;/span&gt; Installing spark&lt;/h1&gt;
&lt;p&gt;We have deliberately installed sparklyr before spark to provide us with the function &lt;strong&gt;spark_install()&lt;/strong&gt; that downloads, installs, and configures the latest version of spark at once.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;#spark_install()&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div id=&#34;connecting-to-spark&#34; class=&#34;section level1&#34; number=&#34;4&#34;&gt;
&lt;h1&gt;&lt;span class=&#34;header-section-number&#34;&gt;4&lt;/span&gt; Connecting to spark&lt;/h1&gt;
&lt;p&gt;Usually, spark is designed to create a clusters using multiple machines either physical machines or virtual machines (in the cloud). However, it can also create a local cluster in your single machine by making use of the CPU’s, if exist in this machine, to speed up the data processing.&lt;/p&gt;
&lt;p&gt;Wherever the clusters are created (local or in cloud), the data processing functions work in the same way, and the only difference is how to create and interact with these clusters. Since this is the case, then we can get started in our local cluster to learn the most basic things of data science such as importing, analyzing, visualizing data, and perform machine learning models using spark via sparklyr.&lt;/p&gt;
&lt;p&gt;To connect to spark in the local mode we use the function &lt;strong&gt;spark_connect&lt;/strong&gt; as follows.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;library(sparklyr)
library(tidyverse)
sc&amp;lt;-spark_connect(master = &amp;quot;local&amp;quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div id=&#34;importing-data&#34; class=&#34;section level1&#34; number=&#34;5&#34;&gt;
&lt;h1&gt;&lt;span class=&#34;header-section-number&#34;&gt;5&lt;/span&gt; Importing data&lt;/h1&gt;
&lt;p&gt;If the data is build-in R we load it to the spark memory using the function &lt;strong&gt;copy_to&lt;/strong&gt;.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;mydata&amp;lt;-copy_to(sc,airquality)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Then R can get access to this data by the help of sparklyr, for example we can use the dplyr function &lt;strong&gt;glimpse&lt;/strong&gt;.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;glimpse(mydata)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## Rows: ??
## Columns: 6
## Database: spark_connection
## $ Ozone   &amp;lt;int&amp;gt; 41, 36, 12, 18, NA, 28, 23, 19, 8, NA, 7, 16, 11, 14, 18, 1...
## $ Solar_R &amp;lt;int&amp;gt; 190, 118, 149, 313, NA, NA, 299, 99, 19, 194, NA, 256, 290,...
## $ Wind    &amp;lt;dbl&amp;gt; 7.4, 8.0, 12.6, 11.5, 14.3, 14.9, 8.6, 13.8, 20.1, 8.6, 6.9...
## $ Temp    &amp;lt;int&amp;gt; 67, 72, 74, 62, 56, 66, 65, 59, 61, 69, 74, 69, 66, 68, 58,...
## $ Month   &amp;lt;int&amp;gt; 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5,...
## $ Day     &amp;lt;int&amp;gt; 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, ...&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;And if the data is stored anywhere outside R with any different format, then sparklyr provides some functions to import these data. For example to load csv file we use the function &lt;strong&gt;spark_read_csv&lt;/strong&gt;, and for json we use &lt;strong&gt;spark_read_json&lt;/strong&gt;. To get the list of all the sparklyr functions and their usages click &lt;a href=&#34;https://cran.r-project.org/web/packages/sparklyr/sparklyr.pdf&#34;&gt;here&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;For illustration we will call the data &lt;strong&gt;creditcards&lt;/strong&gt; stored in my machine as follows&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;card&amp;lt;-spark_read_csv(sc,&amp;quot;creditcard.csv&amp;quot;)
sdf_dim(card)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## [1] 284807     31&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;As you see using the same connection &lt;strong&gt;sc&lt;/strong&gt; we load two data &lt;strong&gt;mydata&lt;/strong&gt; and &lt;strong&gt;card&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;if we want to show what is going on in spark we call the function &lt;strong&gt;spark_web()&lt;/strong&gt; that lead us to the spark website&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;#spark_web(sc)&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div id=&#34;manipulating-data&#34; class=&#34;section level1&#34; number=&#34;6&#34;&gt;
&lt;h1&gt;&lt;span class=&#34;header-section-number&#34;&gt;6&lt;/span&gt; Manipulating data&lt;/h1&gt;
&lt;p&gt;With the help of sparklyr, we can access very easily to the data into spark memory by using the dplyr functions. Let’s apply some manipulations on the data &lt;strong&gt;card&lt;/strong&gt; like, for instance, filtering the data using the variable &lt;strong&gt;Time&lt;/strong&gt; , then computing the mean of &lt;strong&gt;Amount&lt;/strong&gt; for each class label in the variable &lt;strong&gt;Class&lt;/strong&gt;.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;card %&amp;gt;%
  filter(Time &amp;lt;= mean(Time,na.rm = TRUE))%&amp;gt;%
      group_by(Class)%&amp;gt;%
  summarise(Class_avg=mean(Amount,na.rm=TRUE))&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## # Source: spark&amp;lt;?&amp;gt; [?? x 2]
##   Class Class_avg
##   &amp;lt;int&amp;gt;     &amp;lt;dbl&amp;gt;
## 1     0      89.0
## 2     1     117.&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;As you can see now the output is a very small table which can moved from spark memory into R memory for further analysis by making use of the function &lt;strong&gt;collect&lt;/strong&gt;. In other words, if you feel with ease in R then each spark output that is small enough to be processed with R add this function at the end of your script before running it to bring this output into R. For example we cannot use the function &lt;strong&gt;plot&lt;/strong&gt; to plot the above table, that is why we should fist pull this output into R then apply the function &lt;strong&gt;plot&lt;/strong&gt; as follows&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;card %&amp;gt;%
  filter(Time &amp;lt;= mean(Time,na.rm = TRUE))%&amp;gt;%
      group_by(Class)%&amp;gt;%
  summarise(Class_avg=mean(Amount,na.rm=TRUE))%&amp;gt;%
  collect()%&amp;gt;%
  plot(col=&amp;quot;red&amp;quot;,pch=19,main = &amp;quot;Class average vs Class&amp;quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://modelingwithr.rbind.io/sparklyr/sparklyr_files/figure-html/unnamed-chunk-10-1.svg&#34; width=&#34;576&#34; /&gt;&lt;/p&gt;
&lt;p&gt;However , we can plot the sparklyr outputs without having to remove them to R memory by using the &lt;strong&gt;dbplot&lt;/strong&gt; functions, since most of the functions of this package are supported by sparklyr. Let’s for example plot the mean of Amount by Class for cards transaction that have time less than the mean.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;library(dbplot)
card %&amp;gt;%
  filter(Time &amp;lt;= mean(Time,na.rm = TRUE))%&amp;gt;%
        dbplot_bar(Class,mean(Amount))&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## Warning: Missing values are always removed in SQL.
## Use `mean(x, na.rm = TRUE)` to silence this warning
## This warning is displayed only once per session.&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://modelingwithr.rbind.io/sparklyr/sparklyr_files/figure-html/unnamed-chunk-11-1.svg&#34; width=&#34;576&#34; /&gt;&lt;/p&gt;
&lt;p&gt;As we see the Amount mean of fraudulent cards is higher than that of regular cards.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;disconnecting&#34; class=&#34;section level1&#34; number=&#34;7&#34;&gt;
&lt;h1&gt;&lt;span class=&#34;header-section-number&#34;&gt;7&lt;/span&gt; Disconnecting&lt;/h1&gt;
&lt;p&gt;each time you finish your work think to disconnect from spark to save your resources as follows.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;#spark_disconnect(sc)&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div id=&#34;saving-data&#34; class=&#34;section level1&#34; number=&#34;8&#34;&gt;
&lt;h1&gt;&lt;span class=&#34;header-section-number&#34;&gt;8&lt;/span&gt; saving data&lt;/h1&gt;
&lt;p&gt;Sparklyr provides functions to save files directly from spark memory into our directory. For example, to save data in csv file we use spark function &lt;strong&gt;spark_write_csv&lt;/strong&gt; (we can save in other type of formats such as &lt;strong&gt;spark_write_parquet&lt;/strong&gt;,…etc) as follows&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;#spark_write_csv(card,&amp;quot;card.csv&amp;quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div id=&#34;example-of-modeling-in-spark&#34; class=&#34;section level1&#34; number=&#34;9&#34;&gt;
&lt;h1&gt;&lt;span class=&#34;header-section-number&#34;&gt;9&lt;/span&gt; Example of modeling in spark&lt;/h1&gt;
&lt;p&gt;For machine learning models spark has its own library &lt;strong&gt;MLlib&lt;/strong&gt; that has almost every thing we need so that we do not need the library &lt;strong&gt;caret&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;To illustrate how do we perform a machine learning model, we train a logistic regression model to predict the fraudulent cards form the data &lt;strong&gt;card&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;first let’s split the data between training set and testing set as follows, and to do this we use the function &lt;strong&gt;sdf_random_split&lt;/strong&gt; as follows&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;partitions&amp;lt;-card%&amp;gt;%
  sdf_random_split(training=0.8,test=0.2,seed = 123)
train&amp;lt;-partitions$training
test&amp;lt;-partitions$test&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Now we will use the set &lt;strong&gt;train&lt;/strong&gt; to train our model, and for the model performance we make use of the set &lt;strong&gt;test&lt;/strong&gt;.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;model_in_spark&amp;lt;-train %&amp;gt;%
  ml_logistic_regression(Class~.)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;we can get the summary of this model by typing its name&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;model_in_spark&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## Formula: Class ~ .
## 
## Coefficients:
##   (Intercept)          Time            V1            V2            V3 
## -8.305599e+00 -4.074154e-06  1.065118e-01  1.473891e-02 -8.426563e-03 
##            V4            V5            V6            V7            V8 
##  6.996793e-01  1.380980e-01 -1.217416e-01 -1.205822e-01 -1.700146e-01 
##            V9           V10           V11           V12           V13 
## -2.734966e-01 -8.277600e-01 -4.476393e-02  7.416858e-02 -2.828732e-01 
##           V14           V15           V16           V17           V18 
## -5.317753e-01 -1.221061e-01 -2.476344e-01 -1.591295e-03  3.403402e-02 
##           V19           V20           V21           V22           V23 
##  9.213132e-02 -4.914719e-01  3.863870e-01  6.407714e-01 -1.096256e-01 
##           V24           V25           V26           V27           V28 
##  1.366914e-01 -5.108841e-02  9.977837e-02 -8.384655e-01 -3.072630e-01 
##        Amount 
##  1.039041e-03&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Fortunately, sparklyr also supports the functions of &lt;strong&gt;broom&lt;/strong&gt; package so that We can get nicer table using the function &lt;strong&gt;tidy&lt;/strong&gt;.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;library(broom)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## Warning: package &amp;#39;broom&amp;#39; was built under R version 4.0.2&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;tidy(model_in_spark)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## # A tibble: 31 x 2
##    features    coefficients
##    &amp;lt;chr&amp;gt;              &amp;lt;dbl&amp;gt;
##  1 (Intercept)  -8.31      
##  2 Time         -0.00000407
##  3 V1            0.107     
##  4 V2            0.0147    
##  5 V3           -0.00843   
##  6 V4            0.700     
##  7 V5            0.138     
##  8 V6           -0.122     
##  9 V7           -0.121     
## 10 V8           -0.170     
## # ... with 21 more rows&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;To evaluate the model performance we use the function &lt;strong&gt;ml_evaluate&lt;/strong&gt; as follows&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;model_summary&amp;lt;-ml_evaluate(model_in_spark,train)
model_summary&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## BinaryLogisticRegressionSummaryImpl 
##  Access the following via `$` or `ml_summary()`. 
##  - features_col() 
##  - label_col() 
##  - predictions() 
##  - probability_col() 
##  - area_under_roc() 
##  - f_measure_by_threshold() 
##  - pr() 
##  - precision_by_threshold() 
##  - recall_by_threshold() 
##  - roc() 
##  - prediction_col() 
##  - accuracy() 
##  - f_measure_by_label() 
##  - false_positive_rate_by_label() 
##  - labels() 
##  - precision_by_label() 
##  - recall_by_label() 
##  - true_positive_rate_by_label() 
##  - weighted_f_measure() 
##  - weighted_false_positive_rate() 
##  - weighted_precision() 
##  - weighted_recall() 
##  - weighted_true_positive_rate()&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;To extract the metric that we want we use &lt;strong&gt;$&lt;/strong&gt;. we can extract for example &lt;strong&gt;the accuracy rate&lt;/strong&gt;, the &lt;strong&gt;AUC&lt;/strong&gt; or the &lt;strong&gt;roc&lt;/strong&gt;&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;model_summary$area_under_roc()&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## [1] 0.9765604&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;model_summary$accuracy()&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## [1] 0.999149&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;model_summary$roc()&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## # Source: spark&amp;lt;?&amp;gt; [?? x 2]
##        FPR   TPR
##      &amp;lt;dbl&amp;gt; &amp;lt;dbl&amp;gt;
##  1 0       0    
##  2 0.00849 0.876
##  3 0.0185  0.898
##  4 0.0285  0.908
##  5 0.0386  0.917
##  6 0.0487  0.922
##  7 0.0587  0.922
##  8 0.0688  0.925
##  9 0.0788  0.929
## 10 0.0888  0.934
## # ... with more rows&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;we can retrieve this table into R to plot it with ggplot by using the function &lt;strong&gt;collect&lt;/strong&gt;&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;model_summary$roc()%&amp;gt;%
collect()%&amp;gt;%
ggplot(aes(FPR,TPR ))+
  geom_line(col=&amp;quot;blue&amp;quot;)+
  geom_abline(intercept = 0,slope = 1,col=&amp;quot;red&amp;quot;)+
  ggtitle(&amp;quot;the roc of model_in_spark &amp;quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://modelingwithr.rbind.io/sparklyr/sparklyr_files/figure-html/unnamed-chunk-20-1.svg&#34; width=&#34;576&#34; /&gt;&lt;/p&gt;
&lt;p&gt;High accuracy rate for the training set can be only the result of overfitting problem. the accuracy rate using the testing set is the more reliable one.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;pred&amp;lt;-ml_evaluate(model_in_spark,test)
pred$accuracy()&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## [1] 0.9994722&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;pred$area_under_roc()&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## [1] 0.9692241&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Finally, to get the prediction we use the function &lt;strong&gt;ml_predict&lt;/strong&gt;&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;pred&amp;lt;-ml_predict(model_in_spark,test)%&amp;gt;%
select(.,Class,prediction,probability_0,probability_1)
pred  &lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## # Source: spark&amp;lt;?&amp;gt; [?? x 4]
##    Class prediction probability_0 probability_1
##    &amp;lt;int&amp;gt;      &amp;lt;dbl&amp;gt;         &amp;lt;dbl&amp;gt;         &amp;lt;dbl&amp;gt;
##  1     0          0         1.00       0.000221
##  2     0          0         1.00       0.000441
##  3     0          0         1.00       0.000184
##  4     0          0         1.00       0.000490
##  5     0          0         1.00       0.000199
##  6     0          0         0.999      0.000708
##  7     0          0         1.00       0.000231
##  8     0          0         0.999      0.000640
##  9     0          0         1.00       0.000265
## 10     0          0         0.999      0.000720
## # ... with more rows&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Here we can also use the function &lt;strong&gt;collect&lt;/strong&gt; to plot the results&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;pred%&amp;gt;%
  collect()%&amp;gt;%
  ggplot(aes(Class,prediction ))+
  geom_point(size=0.1)+
  geom_jitter()+
  ggtitle(&amp;quot;Actual vs predicted&amp;quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://modelingwithr.rbind.io/sparklyr/sparklyr_files/figure-html/unnamed-chunk-23-1.svg&#34; width=&#34;576&#34; /&gt;&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;streaming&#34; class=&#34;section level1&#34; number=&#34;10&#34;&gt;
&lt;h1&gt;&lt;span class=&#34;header-section-number&#34;&gt;10&lt;/span&gt; Streaming&lt;/h1&gt;
&lt;p&gt;Among the most powrful properties of spark is that can handle streaming data very easily. to show that let’s use a simple example by creating a folder to contain the input for some data transformations and then we save the output in another folder so that each time we add files to the first folder the above transformations will be excuted automotically and the output will be saved in the last folder.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;#dir.create(&amp;quot;raw_data&amp;quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;once the file is created we split the data &lt;strong&gt;card&lt;/strong&gt; into tow parts the first part will be exported now to the folder &lt;strong&gt;raw_data&lt;/strong&gt;, and then we apply some operations using spark functions &lt;strong&gt;stream_read_csv&lt;/strong&gt; and &lt;strong&gt;spark_wrirte_csv&lt;/strong&gt; as follows .&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;#card1&amp;lt;-card%&amp;gt;%
  #filter(Time&amp;lt;=mean(Time,na.rm = TRUE))
#write.csv(card1,&amp;quot;raw_data/card1.csv&amp;quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;#stream &amp;lt;- stream_read_csv(sc,&amp;quot;raw_data/&amp;quot;)%&amp;gt;%
 # select(Class,Amount) %&amp;gt;%
#  stream_write_csv(&amp;quot;result/&amp;quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;If we add the second part in the file raw_data the streaming process lunch to execute the above operation.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;#card2&amp;lt;-card%&amp;gt;%
 # filter(Time&amp;gt;mean(Time,na.rm = TRUE))
#write.csv(card,&amp;quot;raw_data/card2.csv&amp;quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;#dir(&amp;quot;result&amp;quot;,pattern = &amp;quot;.csv&amp;quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;we stop the stream&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;#stream_stop(stream)&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;sdf_describe(card)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## # Source: spark&amp;lt;?&amp;gt; [?? x 32]
##   summary Time  V1    V2    V3    V4    V5    V6    V7    V8    V9    V10  
##   &amp;lt;chr&amp;gt;   &amp;lt;chr&amp;gt; &amp;lt;chr&amp;gt; &amp;lt;chr&amp;gt; &amp;lt;chr&amp;gt; &amp;lt;chr&amp;gt; &amp;lt;chr&amp;gt; &amp;lt;chr&amp;gt; &amp;lt;chr&amp;gt; &amp;lt;chr&amp;gt; &amp;lt;chr&amp;gt; &amp;lt;chr&amp;gt;
## 1 count   2848~ 2848~ 2848~ 2848~ 2848~ 2848~ 2848~ 2848~ 2848~ 2848~ 2848~
## 2 mean    9481~ 1.75~ -8.2~ -9.6~ 8.32~ 1.64~ 4.24~ -3.0~ 8.81~ -1.1~ 7.09~
## 3 stddev  4748~ 1.95~ 1.65~ 1.51~ 1.41~ 1.38~ 1.33~ 1.23~ 1.19~ 1.09~ 1.08~
## 4 min     0     -56.~ -72.~ -48.~ -5.6~ -113~ -26.~ -43.~ -73.~ -13.~ -24.~
## 5 max     1727~ 2.45~ 22.0~ 9.38~ 16.8~ 34.8~ 73.3~ 120.~ 20.0~ 15.5~ 23.7~
## # ... with 20 more variables: V11 &amp;lt;chr&amp;gt;, V12 &amp;lt;chr&amp;gt;, V13 &amp;lt;chr&amp;gt;, V14 &amp;lt;chr&amp;gt;,
## #   V15 &amp;lt;chr&amp;gt;, V16 &amp;lt;chr&amp;gt;, V17 &amp;lt;chr&amp;gt;, V18 &amp;lt;chr&amp;gt;, V19 &amp;lt;chr&amp;gt;, V20 &amp;lt;chr&amp;gt;,
## #   V21 &amp;lt;chr&amp;gt;, V22 &amp;lt;chr&amp;gt;, V23 &amp;lt;chr&amp;gt;, V24 &amp;lt;chr&amp;gt;, V25 &amp;lt;chr&amp;gt;, V26 &amp;lt;chr&amp;gt;,
## #   V27 &amp;lt;chr&amp;gt;, V28 &amp;lt;chr&amp;gt;, Amount &amp;lt;chr&amp;gt;, Class &amp;lt;chr&amp;gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Now we disconnect&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;spark_disconnect(sc)&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
</description>
    </item>
    
    <item>
      <title>An example journal article</title>
      <link>https://modelingwithr.rbind.io/publication/journal-article/</link>
      <pubDate>Tue, 01 Sep 2015 00:00:00 +0000</pubDate>
      <guid>https://modelingwithr.rbind.io/publication/journal-article/</guid>
      <description>&lt;div class=&#34;alert alert-note&#34;&gt;
  &lt;div&gt;
    Click the &lt;em&gt;Cite&lt;/em&gt; button above to demo the feature to enable visitors to import publication metadata into their reference management software.
  &lt;/div&gt;
&lt;/div&gt;
&lt;div class=&#34;alert alert-note&#34;&gt;
  &lt;div&gt;
    Click the &lt;em&gt;Slides&lt;/em&gt; button above to demo Academic&amp;rsquo;s Markdown slides feature.
  &lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;Supplementary notes can be added here, including 
&lt;a href=&#34;https://sourcethemes.com/academic/docs/writing-markdown-latex/&#34; target=&#34;_blank&#34; rel=&#34;noopener&#34;&gt;code and math&lt;/a&gt;.&lt;/p&gt;
</description>
    </item>
    
    <item>
      <title>An example conference paper</title>
      <link>https://modelingwithr.rbind.io/publication/conference-paper/</link>
      <pubDate>Mon, 01 Jul 2013 00:00:00 +0000</pubDate>
      <guid>https://modelingwithr.rbind.io/publication/conference-paper/</guid>
      <description>&lt;div class=&#34;alert alert-note&#34;&gt;
  &lt;div&gt;
    Click the &lt;em&gt;Cite&lt;/em&gt; button above to demo the feature to enable visitors to import publication metadata into their reference management software.
  &lt;/div&gt;
&lt;/div&gt;
&lt;div class=&#34;alert alert-note&#34;&gt;
  &lt;div&gt;
    Click the &lt;em&gt;Slides&lt;/em&gt; button above to demo Academic&amp;rsquo;s Markdown slides feature.
  &lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;Supplementary notes can be added here, including 
&lt;a href=&#34;https://sourcethemes.com/academic/docs/writing-markdown-latex/&#34; target=&#34;_blank&#34; rel=&#34;noopener&#34;&gt;code and math&lt;/a&gt;.&lt;/p&gt;
</description>
    </item>
    
  </channel>
</rss>
