Related Post

To do so, I selected and extracted features from the raw data, including age, days between onset and outcome, gender, whether the patients were hospitalised, etc. Missing values were imputed and different model algorithms were used to predict outcome (death or recovery). The prediction accuracy, sensitivity and specificity. The thus prepared dataset was devided into training and testing subsets. The test subset contained all cases with an unknown outcome. Before I applied the models to the test data, I further split the training data into validation subsets.

The tested modeling algorithms were similarly successful at predicting the outcomes of the validation data. To decide on final classifications, I compared predictions from all models and defined the outcome “Death” or “Recovery” as a function of all models, whereas classifications with a low prediction probability were flagged as “uncertain”. Accounting for this uncertainty led to a 100% correct classification of the validation test set.

The training cases with unknown outcome were then classified based on the same algorithms. From 57 unknown cases, 14 were classified as “Recovery”, 10 as “Death” and 33 as uncertain.

The dataset contains case ID, date of onset, date of hospitalisation, date of outcome, gender, age, province and of course the outcome: Death or Recovery. I can already see that there are a couple of missing values in the data, which I will deal with later.

# install and load package if (!require("outbreaks")) install.packages("outbreaks") library(outbreaks) fluH7N9.china.2013_backup <- fluH7N9.china.2013 # back up original dataset in case something goes awry along the way # convert ? to NAs fluH7N9.china.2013$age[which(fluH7N9.china.2013$age == "?")] <- NA # create a new column with case ID fluH7N9.china.2013$case.ID <- paste("case", fluH7N9.china.2013$case.ID, sep = "_") head(fluH7N9.china.2013)## case.ID date.of.onset date.of.hospitalisation date.of.outcome outcome gender age province ## 1 case_1 2013-02-192013-03-04 Death m 87 Shanghai ## 2 case_2 2013-02-27 2013-03-03 2013-03-10 Death m 27 Shanghai ## 3 case_3 2013-03-09 2013-03-19 2013-04-09 Death f 35 Anhui ## 4 case_4 2013-03-19 2013-03-27 f 45 Jiangsu ## 5 case_5 2013-03-19 2013-03-30 2013-05-15 Recover f 48 Jiangsu ## 6 case_6 2013-03-21 2013-03-28 2013-04-26 Death f 32 Jiangsu

Before I start preparing the data for Machine Learning, I want to get an idea of the distribution of the data points and their different variables by plotting. Most provinces have only a handful of cases, so I am combining them into the category “other” and keep only Jiangsu, Shanghai and Zhejian and separate provinces.

# gather for plotting with ggplot2 library(tidyr) fluH7N9.china.2013_gather <- fluH7N9.china.2013 %>% gather(Group, Date, date.of.onset:date.of.outcome) # rearrange group order fluH7N9.china.2013_gather$Group <- factor(fluH7N9.china.2013_gather$Group, levels = c("date.of.onset", "date.of.hospitalisation", "date.of.outcome")) # rename groups library(plyr) fluH7N9.china.2013_gather$Group <- mapvalues(fluH7N9.china.2013_gather$Group, from = c("date.of.onset", "date.of.hospitalisation", "date.of.outcome"), to = c("Date of onset", "Date of hospitalisation", "Date of outcome")) # renaming provinces fluH7N9.china.2013_gather$province <- mapvalues(fluH7N9.china.2013_gather$province, from = c("Anhui", "Beijing", "Fujian", "Guangdong", "Hebei", "Henan", "Hunan", "Jiangxi", "Shandong", "Taiwan"), to = rep("Other", 10)) # add a level for unknown gender levels(fluH7N9.china.2013_gather$gender) <- c(levels(fluH7N9.china.2013_gather$gender), "unknown") fluH7N9.china.2013_gather$gender[is.na(fluH7N9.china.2013_gather$gender)] <- "unknown" # rearrange province order so that Other is the last fluH7N9.china.2013_gather$province <- factor(fluH7N9.china.2013_gather$province, levels = c("Jiangsu", "Shanghai", "Zhejiang", "Other")) # convert age to numeric fluH7N9.china.2013_gather$age <- as.numeric(as.character(fluH7N9.china.2013_gather$age)) library(ggplot2) my_theme <- function(base_size = 12, base_family = "sans"){ theme_minimal(base_size = base_size, base_family = base_family) + theme( axis.text = element_text(size = 12), axis.text.x = element_text(angle = 45, vjust = 0.5, hjust = 0.5), axis.title = element_text(size = 14), panel.grid.major = element_line(color = "grey"), panel.grid.minor = element_blank(), panel.background = element_rect(fill = "aliceblue"), strip.background = element_rect(fill = "lightgrey", color = "grey", size = 1), strip.text = element_text(face = "bold", size = 12, color = "black"), legend.position = "bottom", legend.justification = "top", legend.box = "horizontal", legend.box.background = element_rect(colour = "grey50"), legend.background = element_blank(), panel.border = element_rect(color = "grey", fill = NA, size = 0.5) ) } # plotting raw data ggplot(data = fluH7N9.china.2013_gather, aes(x = Date, y = age, fill = outcome)) + stat_density2d(aes(alpha = ..level..), geom = "polygon") + geom_jitter(aes(color = outcome, shape = gender), size = 1.5) + geom_rug(aes(color = outcome)) + labs( fill = "Outcome", color = "Outcome", alpha = "Level", shape = "Gender", x = "Date in 2013", y = "Age", title = "2013 Influenza A H7N9 cases in China", subtitle = "Dataset from 'outbreaks' package (Kucharski et al. 2014)", caption = "" ) + facet_grid(Group ~ province) + my_theme() + scale_shape_manual(values = c(15, 16, 17)) + scale_color_brewer(palette="Set1", na.value = "grey50") + scale_fill_brewer(palette="Set1")

This plot shows the dates of onset, hospitalisation and outcome (if known) of each data point. Outcome is marked by color and age shown on the y-axis. Gender is marked by point shape. The density distribution of date by age for the cases seems to indicate that older people died more frequently in the Jiangsu and Zhejiang province than in Shanghai and in other provinces. When we look at the distribution of points along the time axis, it suggests that their might be a positive correlation between the likelihood of death and an early onset or early outcome.

I also want to know how many cases there are for each gender and province and compare the genders’ age distribution.

fluH7N9.china.2013_gather_2 <- fluH7N9.china.2013_gather[, -4] %>% gather(group_2, value, gender:province) fluH7N9.china.2013_gather_2$value <- mapvalues(fluH7N9.china.2013_gather_2$value, from = c("m", "f", "unknown", "Other"), to = c("Male", "Female", "Unknown gender", "Other province")) fluH7N9.china.2013_gather_2$value <- factor(fluH7N9.china.2013_gather_2$value, levels = c("Female", "Male", "Unknown gender", "Jiangsu", "Shanghai", "Zhejiang", "Other province")) p1 <- ggplot(data = fluH7N9.china.2013_gather_2, aes(x = value, fill = outcome, color = outcome)) + geom_bar(position = "dodge", alpha = 0.7, size = 1) + my_theme() + scale_fill_brewer(palette="Set1", na.value = "grey50") + scale_color_brewer(palette="Set1", na.value = "grey50") + labs( color = "", fill = "", x = "", y = "Count", title = "2013 Influenza A H7N9 cases in China", subtitle = "Gender and Province numbers of flu cases", caption = "" ) p2 <- ggplot(data = fluH7N9.china.2013_gather, aes(x = age, fill = outcome, color = outcome)) + geom_density(alpha = 0.3, size = 1) + geom_rug() + scale_color_brewer(palette="Set1", na.value = "grey50") + scale_fill_brewer(palette="Set1", na.value = "grey50") + my_theme() + labs( color = "", fill = "", x = "Age", y = "Density", title = "", subtitle = "Age distribution of flu cases", caption = "" ) library(gridExtra) library(grid) grid.arrange(p1, p2, ncol = 2)

In the dataset, there are more male than female cases and correspondingly, we see more deaths, recoveries and unknown outcomes in men than in women. This is potentially a problem later on for modeling because the inherent likelihoods for outcome are not directly comparable between the sexes. Most unknown outcomes were recorded in Zhejiang. Similarly to gender, we don’t have an equal distribution of data points across provinces either. When we look at the age distribution it is obvious that people who died tended to be slightly older than those who recovered. The density curve of unknown outcomes is more similar to that of death than of recovery, suggesting that among these people there might have been more deaths than recoveries. And lastly, I want to plot how many days passed between onset, hospitalisation and outcome for each case.

ggplot(data = fluH7N9.china.2013_gather, aes(x = Date, y = age, color = outcome)) + geom_point(aes(shape = gender), size = 1.5, alpha = 0.6) + geom_path(aes(group = case.ID)) + facet_wrap( ~ province, ncol = 2) + my_theme() + scale_shape_manual(values = c(15, 16, 17)) + scale_color_brewer(palette="Set1", na.value = "grey50") + scale_fill_brewer(palette="Set1") + labs( color = "Outcome", shape = "Gender", x = "Date in 2013", y = "Age", title = "2013 Influenza A H7N9 cases in China", subtitle = "Dataset from 'outbreaks' package (Kucharski et al. 2014)", caption = "\nTime from onset of flu to outcome." )

Gives this plot which shows that there are many missing values in the dates, so it is hard to draw a general conclusion.

In Machine Learning-speak features are the variables used for model training. Using the right features dramatically influences the accuracy of the model.

Because we don’t have many features, I am keeping age as it is, but I am also generating new features:

- from the date information I am calculating the days between onset and outcome and between onset and hospitalisation
- I am converting gender into numeric values with 1 for female and 0 for male
- similarly, I am converting provinces to binary classifiers (yes == 1, no == 0) for Shanghai, Zhejiang, Jiangsu and other provinces
- the same binary classification is given for whether a case was hospitalised, and whether they had an early onset or early outcome (earlier than the median date)

# preparing the data frame for modeling # library(dplyr) dataset <- fluH7N9.china.2013 %>% mutate(hospital = as.factor(ifelse(is.na(date.of.hospitalisation), 0, 1)), gender_f = as.factor(ifelse(gender == "f", 1, 0)), province_Jiangsu = as.factor(ifelse(province == "Jiangsu", 1, 0)), province_Shanghai = as.factor(ifelse(province == "Shanghai", 1, 0)), province_Zhejiang = as.factor(ifelse(province == "Zhejiang", 1, 0)), province_other = as.factor(ifelse(province == "Zhejiang" | province == "Jiangsu" | province == "Shanghai", 0, 1)), days_onset_to_outcome = as.numeric(as.character(gsub(" days", "", as.Date(as.character(date.of.outcome), format = "%Y-%m-%d") - as.Date(as.character(date.of.onset), format = "%Y-%m-%d")))), days_onset_to_hospital = as.numeric(as.character(gsub(" days", "", as.Date(as.character(date.of.hospitalisation), format = "%Y-%m-%d") - as.Date(as.character(date.of.onset), format = "%Y-%m-%d")))), age = as.numeric(as.character(age)), early_onset = as.factor(ifelse(date.of.onset < summary(fluH7N9.china.2013$date.of.onset)[[3]], 1, 0)), early_outcome = as.factor(ifelse(date.of.outcome < summary(fluH7N9.china.2013$date.of.outcome)[[3]], 1, 0))) %>% subset(select = -c(2:4, 6, 8)) rownames(dataset) <- dataset$case.ID dataset <- dataset[, -1]

When looking at the dataset I created for modeling, it is obvious that we have quite a few missing values. The missing values from the outcome column are what I want to predict but for the rest I would either have to remove the entire row from the data or impute the missing information. I decided to try the latter with the mice package.

# impute missing data library(mice) dataset_impute <- mice(dataset[, -1], print = FALSE) # recombine imputed data frame with the outcome column dataset_complete <- merge(dataset[, 1, drop = FALSE], mice::complete(dataset_impute, 1), by = "row.names", all = TRUE) rownames(dataset_complete) <- dataset_complete$Row.names dataset_complete <- dataset_complete[, -1]

For building the model, I am separating the imputed data frame into training and test data. Test data are the 57 cases with unknown outcome.

summary(dataset$outcome)## Death Recover NA's ## 32 47 57

The training data will be further devided for validation of the models: 70% of the training data will be kept for model building and the remaining 30% will be used for model testing. I am using the caret package for modeling.

train_index <- which(is.na(dataset_complete$outcome)) train_data <- dataset_complete[-train_index, ] test_data <- dataset_complete[train_index, -1] library(caret) set.seed(27) val_index <- createDataPartition(train_data$outcome, p = 0.7, list=FALSE) val_train_data <- train_data[val_index, ] val_test_data <- train_data[-val_index, ] val_train_X <- val_train_data[,-1] val_test_X <- val_test_data[,-1]

To get an idea about how each feature contributes to the prediction of the outcome, I first built a decision tree based on the training data using rpart and rattle.

library(rpart) library(rattle) library(rpart.plot) library(RColorBrewer) set.seed(27) fit <- rpart(outcome ~ ., data = train_data, method = "class", control = rpart.control(xval = 10, minbucket = 2, cp = 0), parms = list(split = "information")) fancyRpartPlot(fit)

This randomly generated decision tree shows that cases with an early outcome were most likely to die when they were 68 or older, when they also had an early onset and when they were sick for fewer than 13 days. If a person was not among the first cases and was younger than 52, they had a good chance of recovering, but if they were 82 or older, they were more likely to die from the flu.

Not all of the features I created will be equally important to the model. The decision tree already gave me an idea of which features might be most important but I also want to estimate feature importance using a Random Forest approach with repeated cross validation.

# prepare training scheme control <- trainControl(method = "repeatedcv", number = 10, repeats = 10) # train the model set.seed(27) model <- train(outcome ~ ., data = train_data, method = "rf", preProcess = NULL, trControl = control) # estimate variable importance importance <- varImp(model, scale=TRUE) # prepare for plotting importance_df_1 <- importance$importance importance_df_1$group <- rownames(importance_df_1) importance_df_1$group <- mapvalues(importance_df_1$group, from = c("age", "hospital2", "gender_f2", "province_Jiangsu2", "province_Shanghai2", "province_Zhejiang2", "province_other2", "days_onset_to_outcome", "days_onset_to_hospital", "early_onset2", "early_outcome2" ), to = c("Age", "Hospital", "Female", "Jiangsu", "Shanghai", "Zhejiang", "Other province", "Days onset to outcome", "Days onset to hospital", "Early onset", "Early outcome" )) f = importance_df_1[order(importance_df_1$Overall, decreasing = FALSE), "group"] importance_df_2 <- importance_df_1 importance_df_2$Overall <- 0 importance_df <- rbind(importance_df_1, importance_df_2) # setting factor levels importance_df <- within(importance_df, group <- factor(group, levels = f)) importance_df_1 <- within(importance_df_1, group <- factor(group, levels = f)) ggplot() + geom_point(data = importance_df_1, aes(x = Overall, y = group, color = group), size = 2) + geom_path(data = importance_df, aes(x = Overall, y = group, color = group, group = group), size = 1) + scale_color_manual(values = rep(brewer.pal(1, "Set1")[1], 11)) + my_theme() + theme(legend.position = "none", axis.text.x = element_text(angle = 0, vjust = 0.5, hjust = 0.5)) + labs( x = "Importance", y = "", title = "2013 Influenza A H7N9 cases in China", subtitle = "Scaled feature importance", caption = "\nDetermined with Random Forest and repeated cross validation (10 repeats, 10 times)" )

Before I start actually building models, I want to check whether the distribution of feature values is comparable between training, validation and test datasets.

# prepare for plotting dataset_complete_gather <- dataset_complete %>% mutate(set = ifelse(rownames(dataset_complete) %in% rownames(test_data), "Test Data", ifelse(rownames(dataset_complete) %in% rownames(val_train_data), "Validation Train Data", ifelse(rownames(dataset_complete) %in% rownames(val_test_data), "Validation Test Data", "NA"))), case_ID = rownames(.)) %>% gather(group, value, age:early_outcome) dataset_complete_gather$group <- mapvalues(dataset_complete_gather$group, from = c("age", "hospital", "gender_f", "province_Jiangsu", "province_Shanghai", "province_Zhejiang", "province_other", "days_onset_to_outcome", "days_onset_to_hospital", "early_onset", "early_outcome" ), to = c("Age", "Hospital", "Female", "Jiangsu", "Shanghai", "Zhejiang", "Other province", "Days onset to outcome", "Days onset to hospital", "Early onset", "Early outcome" )) ggplot(data = dataset_complete_gather, aes(x = as.numeric(value), fill = outcome, color = outcome)) + geom_density(alpha = 0.2) + geom_rug() + scale_color_brewer(palette="Set1", na.value = "grey50") + scale_fill_brewer(palette="Set1", na.value = "grey50") + my_theme() + facet_wrap(set ~ group, ncol = 11, scales = "free") + labs( x = "Value", y = "Density", title = "2013 Influenza A H7N9 cases in China", subtitle = "Features for classifying outcome", caption = "\nDensity distribution of all features used for classification of flu outcome." )

ggplot(subset(dataset_complete_gather, group == "Age" | group == "Days onset to hospital" | group == "Days onset to outcome"), aes(x=outcome, y=as.numeric(value), fill=set)) + geom_boxplot() + my_theme() + scale_fill_brewer(palette="Set1", type = "div ") + facet_wrap( ~ group, ncol = 3, scales = "free") + labs( fill = "", x = "Outcome", y = "Value", title = "2013 Influenza A H7N9 cases in China", subtitle = "Features for classifying outcome", caption = "\nBoxplot of the features age, days from onset to hospitalisation and days from onset to outcome." )

Gives this plot which luckily, the distributions looks reasonably similar between the validation and test data, except for a few outliers.

Before I try to predict the outcome of the unknown cases, I am testing the models’ accuracy with the validation datasets on a couple of algorithms. I have chosen only a few more well known algorithms, but caret implements many more. I have chose to not do any preprocessing because I was worried that the different data distributions with continuous variables (e.g. age) and binary variables (i.e. 0, 1 classification of e.g. hospitalisation) would lead to problems.

Random Forests predictions are based on the generation of multiple classification trees. This model classified 14 out of 23 cases correctly.

set.seed(27) model_rf <- caret::train(outcome ~ ., data = val_train_data, method = "rf", preProcess = NULL, trControl = trainControl(method = "repeatedcv", number = 10, repeats = 10, verboseIter = FALSE)) confusionMatrix(predict(model_rf, val_test_data[, -1]), val_test_data$outcome)

Lasso or elastic net regularization for generalized linear model regression are based on linear regression models and is useful when we have feature correlation in our model. This model classified 13 out of 23 cases correctly.

set.seed(27) model_glmnet <- caret::train(outcome ~ ., data = val_train_data, method = "glmnet", preProcess = NULL, trControl = trainControl(method = "repeatedcv", number = 10, repeats = 10, verboseIter = FALSE)) confusionMatrix(predict(model_glmnet, val_test_data[, -1]), val_test_data$outcome)

set.seed(27) model_kknn <- caret::train(outcome ~ ., data = val_train_data, method = "kknn", preProcess = NULL, trControl = trainControl(method = "repeatedcv", number = 10, repeats = 10, verboseIter = FALSE)) confusionMatrix(predict(model_kknn, val_test_data[, -1]), val_test_data$outcome)

set.seed(27) model_pda <- caret::train(outcome ~ ., data = val_train_data, method = "pda", preProcess = NULL, trControl = trainControl(method = "repeatedcv", number = 10, repeats = 10, verboseIter = FALSE)) confusionMatrix(predict(model_pda, val_test_data[, -1]), val_test_data$outcome)

Stabilized Linear Discriminant Analysis is designed for high-dimensional data and correlated co-variables. This model classified 15 out of 23 cases correctly.

set.seed(27) model_slda <- caret::train(outcome ~ ., data = val_train_data, method = "slda", preProcess = NULL, trControl = trainControl(method = "repeatedcv", number = 10, repeats = 10, verboseIter = FALSE)) confusionMatrix(predict(model_slda, val_test_data[, -1]), val_test_data$outcome)

Nearest Shrunken Centroids computes a standardized centroid for each class and shrinks each centroid toward the overall centroid for all classes. This model classified 15 out of 23 cases correctly.

set.seed(27) model_pam <- caret::train(outcome ~ ., data = val_train_data, method = "pam", preProcess = NULL, trControl = trainControl(method = "repeatedcv", number = 10, repeats = 10, verboseIter = FALSE)) confusionMatrix(predict(model_pam, val_test_data[, -1]), val_test_data$outcome)

C5.0 is another tree-based modeling algorithm. This model classified 15 out of 23 cases correctly.

set.seed(27) model_C5.0Tree <- caret::train(outcome ~ ., data = val_train_data, method = "C5.0Tree", preProcess = NULL, trControl = trainControl(method = "repeatedcv", number = 10, repeats = 10, verboseIter = FALSE)) confusionMatrix(predict(model_C5.0Tree, val_test_data[, -1]), val_test_data$outcome)

Partial least squares regression combined principal component analysis and multiple regression and is useful for modeling with correlated features. This model classified 15 out of 23 cases correctly.

set.seed(27) model_pls <- caret::train(outcome ~ ., data = val_train_data, method = "pls", preProcess = NULL, trControl = trainControl(method = "repeatedcv", number = 10, repeats = 10, verboseIter = FALSE)) confusionMatrix(predict(model_pls, val_test_data[, -1]), val_test_data$outcome)

All models were similarly accurate.

# Create a list of models models <- list(rf = model_rf, glmnet = model_glmnet, kknn = model_kknn, pda = model_pda, slda = model_slda, pam = model_pam, C5.0Tree = model_C5.0Tree, pls = model_pls) # Resample the models resample_results <- resamples(models) # Generate a summary summary(resample_results, metric = c("Kappa", "Accuracy")) bwplot(resample_results , metric = c("Kappa","Accuracy"))

To compare the predictions from all models, I summed up the prediction probabilities for Death and Recovery from all models and calculated the log2 of the ratio between the summed probabilities for Recovery by the summed probabilities for Death. All cases with a log2 ratio bigger than 1.5 were defined as Recover, cases with a log2 ratio below -1.5 were defined as Death, and the remaining cases were defined as uncertain.

results <- data.frame(randomForest = predict(model_rf, newdata = val_test_data[, -1], type="prob"), glmnet = predict(model_glmnet, newdata = val_test_data[, -1], type="prob"), kknn = predict(model_kknn, newdata = val_test_data[, -1], type="prob"), pda = predict(model_pda, newdata = val_test_data[, -1], type="prob"), slda = predict(model_slda, newdata = val_test_data[, -1], type="prob"), pam = predict(model_pam, newdata = val_test_data[, -1], type="prob"), C5.0Tree = predict(model_C5.0Tree, newdata = val_test_data[, -1], type="prob"), pls = predict(model_pls, newdata = val_test_data[, -1], type="prob")) results$sum_Death <- rowSums(results[, grep("Death", colnames(results))]) results$sum_Recover <- rowSums(results[, grep("Recover", colnames(results))]) results$log2_ratio <- log2(results$sum_Recover/results$sum_Death) results$true_outcome <- val_test_data$outcome results$pred_outcome <- ifelse(results$log2_ratio > 1.5, "Recover", ifelse(results$log2_ratio < -1.5, "Death", "uncertain")) results$prediction <- ifelse(results$pred_outcome == results$true_outcome, "CORRECT", ifelse(results$pred_outcome == "uncertain", "uncertain", "wrong"))

All predictions based on all models were either correct or uncertain.

The above models will now be used to predict the outcome of cases with unknown fate.

set.seed(27) model_rf <- caret::train(outcome ~ ., data = train_data, method = "rf", preProcess = NULL, trControl = trainControl(method = "repeatedcv", number = 10, repeats = 10, verboseIter = FALSE)) model_glmnet <- caret::train(outcome ~ ., data = train_data, method = "glmnet", preProcess = NULL, trControl = trainControl(method = "repeatedcv", number = 10, repeats = 10, verboseIter = FALSE)) model_kknn <- caret::train(outcome ~ ., data = train_data, method = "kknn", preProcess = NULL, trControl = trainControl(method = "repeatedcv", number = 10, repeats = 10, verboseIter = FALSE)) model_pda <- caret::train(outcome ~ ., data = train_data, method = "pda", preProcess = NULL, trControl = trainControl(method = "repeatedcv", number = 10, repeats = 10, verboseIter = FALSE)) model_slda <- caret::train(outcome ~ ., data = train_data, method = "slda", preProcess = NULL, trControl = trainControl(method = "repeatedcv", number = 10, repeats = 10, verboseIter = FALSE)) model_pam <- caret::train(outcome ~ ., data = train_data, method = "pam", preProcess = NULL, trControl = trainControl(method = "repeatedcv", number = 10, repeats = 10, verboseIter = FALSE)) model_C5.0Tree <- caret::train(outcome ~ ., data = train_data, method = "C5.0Tree", preProcess = NULL, trControl = trainControl(method = "repeatedcv", number = 10, repeats = 10, verboseIter = FALSE)) model_pls <- caret::train(outcome ~ ., data = train_data, method = "pls", preProcess = NULL, trControl = trainControl(method = "repeatedcv", number = 10, repeats = 10, verboseIter = FALSE)) models <- list(rf = model_rf, glmnet = model_glmnet, kknn = model_kknn, pda = model_pda, slda = model_slda, pam = model_pam, C5.0Tree = model_C5.0Tree, pls = model_pls) # Resample the models resample_results <- resamples(models) bwplot(resample_results , metric = c("Kappa","Accuracy"))

Here again, the accuracy is similar in all models. The final results are calculate as described above.

results <- data.frame(randomForest = predict(model_rf, newdata = test_data, type="prob"), glmnet = predict(model_glmnet, newdata = test_data, type="prob"), kknn = predict(model_kknn, newdata = test_data, type="prob"), pda = predict(model_pda, newdata = test_data, type="prob"), slda = predict(model_slda, newdata = test_data, type="prob"), pam = predict(model_pam, newdata = test_data, type="prob"), C5.0Tree = predict(model_C5.0Tree, newdata = test_data, type="prob"), pls = predict(model_pls, newdata = test_data, type="prob")) results$sum_Death <- rowSums(results[, grep("Death", colnames(results))]) results$sum_Recover <- rowSums(results[, grep("Recover", colnames(results))]) results$log2_ratio <- log2(results$sum_Recover/results$sum_Death) results$predicted_outcome <- ifelse(results$log2_ratio > 1.5, "Recover", ifelse(results$log2_ratio < -1.5, "Death", "uncertain"))

From 57 cases, 14 were defined as Recover, 10 as Death and 33 as uncertain.

results_combined <- merge(results[, -c(1:16)], fluH7N9.china.2013[which(fluH7N9.china.2013$case.ID %in% rownames(results)), ], by.x = "row.names", by.y = "case.ID") results_combined <- results_combined[, -c(2, 3, 8, 9)] results_combined_gather <- results_combined %>% gather(group_dates, date, date.of.onset:date.of.hospitalisation) results_combined_gather$group_dates <- factor(results_combined_gather$group_dates, levels = c("date.of.onset", "date.of.hospitalisation")) results_combined_gather$group_dates <- mapvalues(results_combined_gather$group_dates, from = c("date.of.onset", "date.of.hospitalisation"), to = c("Date of onset", "Date of hospitalisation")) results_combined_gather$gender <- mapvalues(results_combined_gather$gender, from = c("f", "m"), to = c("Female", "Male")) levels(results_combined_gather$gender) <- c(levels(results_combined_gather$gender), "unknown") results_combined_gather$gender[is.na(results_combined_gather$gender)] <- "unknown" results_combined_gather$age <- as.numeric(as.character(results_combined_gather$age)) ggplot(data = results_combined_gather, aes(x = date, y = log2_ratio, color = predicted_outcome)) + geom_jitter(aes(size = age), alpha = 0.3) + geom_rug() + facet_grid(gender ~ group_dates) + labs( color = "Predicted outcome", size = "Age", x = "Date in 2013", y = "log2 ratio of prediction Recover vs Death", title = "2013 Influenza A H7N9 cases in China", subtitle = "Predicted outcome", caption = "" ) + my_theme() + scale_color_brewer(palette="Set1") + scale_fill_brewer(palette="Set1")

The comparison of date of onset, data of hospitalisation, gender and age with predicted outcome shows that predicted deaths were associated with older age than predicted Recoveries. Date of onset does not show an obvious bias in either direction.

This dataset posed a couple of difficulties to begin with, like unequal distribution of data points across variables and missing data. This makes the modeling inherently prone to flaws. However, real life data isn’t perfect either, so I went ahead and tested the modeling success anyway. By accounting for uncertain classification with low predictions probability, the validation data could be classified accurately. However, for a more accurate model, these few cases don’t give enough information to reliably predict the outcome. More cases, more information (i.e. more features) and fewer missing data would improve the modeling outcome. Also, this example is only applicable for this specific case of flu. In order to be able to draw more general conclusions about flu outcome, other cases and additional information, for example on medical parameters like preexisting medical conditions, disase parameters, demographic information, etc. would be necessary. All in all, this dataset served as a nice example of the possibilities (and pitfalls) of machine learning applications and showcases a basic workflow for building prediction models with R.

If you have questions feel free to post a comment.

Related Post

Related Post

The example that I will use throughout this post is the logistic growth function, it is often used in ecology to model population growth. For instance, say you count the number of bacteria cells in a petri dish, in the beginning the cell counts will increase exponentially but after some time due to limits in resources (be it space or food), the bacteria population will reach an equilibrium. This will produce the classical S-shaped, non-linear, logistic growth function. The logistic growth function has three parameters: the growth rate called “r”, the population size at equilibrium called “K” and the population size at the beginning called “n0”.

Say that we want to test the effect of food quality on the population dynamic of our bacteria. To do so we use three different type of food and we count the number of bacteria over time. Here is how to code this in R using the function `nlsList`

from the package nlme:

#load libraries library(nlme) #first try effect of treatment on logistic growth Ks <- c(100,200,150) n0 <- c(5,5,6) r <- c(0.15,0.2,0.15) time <- 1:50 #this function returns population dynamics following #a logistic curves logF <- function(time,K,n0,r){ d <- K * n0 * exp(r*time) / (K + n0 * (exp(r*time) - 1)) return(d) } #simulate some data dat <- data.frame(Treatment=character(),Time=numeric(), Abundance=numeric()) for(i in 1:3){ Ab <- logF(time = time,K=Ks[i],n0=n0[i],r=r[i]) tmp <- data.frame(Treatment=paste0("T",i),Time=time, Abundance=Ab+rnorm(time,0,5)) #note that random deviates were added to the simulated #population density values dat <-rbind(dat,tmp) } #the formula for the models lF<-formula(Abundance~K*n0*exp(r*Time)/(K+n0*(exp(r*Time)-1)) | Treatment) #fit the model (m <- nlsList(lF,data=dat,start=list(K=150,n0=10,r=0.5)))Call: Model: Abundance ~ K * n0 * exp(r * Time)/(K + n0 * (exp(r * Time) - 1)) | Treatment Data: dat Coefficients: K n0 r T1 105.2532 4.996502 0.1470740 T2 199.5149 4.841359 0.2006150 T3 150.2882 5.065196 0.1595293 Degrees of freedom: 150 total; 141 residual Residual standard error: 5.085229

The code simulated population values using three sets of parameters (the r, K and n0’s). Then we specified the non-linear regression formula, using the pipe “|” symbol to explicitly ask for fitting different parameters to each Treatment. The model output gives us the estimated parameters for each Treatment. We can now plot the model predictions against the real data:

#derive the predicted lines dat$Pred <-predict(m) #plot ggplot(dat,aes(x=Time,y=Abundance,color=Treatment))+ geom_point()+geom_line(aes(y=Pred))

We can also model the effect of continuous predictors on the parameters of the non-linear regression. For instance, we might assume that bacterial colonies grow faster in warmer temperature. In this case we could model the effect of temperature on growth rate (assuming that it is linear) and use the fitted growth rate to model the number of bacteria, all this within one model, pretty neat.

To do so I will use another package: bbmle and its function `mle2`

:

#load libraries library(bbmle) library(reshape2) #parameter for simulation K <- 200 n0 <- 5 #the gradient in temperature Temp <- ceiling(seq(0,20,length=10)) #simulate some data mm <- sapply(Temp,function(x){ rpois(50,logF(time = 1:50,K = K,n0=n0,r = 0.05 + 0.01 * x))}) #sample data from a poisson distribution with lambda parameter equal #to the expected value, note that we do not provide one value for the #parameter r but a vector of values representing the effect of temperature #on r #some reshaping datT <- melt(mm) names(datT) <- c("Time","Temp","Abund") datT$Temp <- Temp[datT$Temp] #fit the model (mll <- mle2(Abund~dpois(K*n0*exp(r*Time)/(K+n0*(exp(r*Time)-1))), data=datT,parameters = list(r~Temp), start=list(K=100,n0=10,r=0.05)))Call: mle2(minuslogl = Abund ~ dpois(K * n0 * exp(r * Time)/(K + n0 * (exp(r * Time) - 1))), start = list(K = 100, n0 = 10, r = 0.05), data = datT, parameters = list(r ~ Temp)) Coefficients: K n0 r.(Intercept) r.Temp 200.13176811 4.60546966 0.05195730 0.01016994 Log-likelihood: -1744.01

The first argument for `mle2`

is either a function returning the negative log-likelihood or a formula of the form “response ~ some distribution(expected value)”. In our case we fit a poisson distribution with expected values coming form the logistic equation and its three parameters. The dependency between the growth rate and the temperature is given using the “parameter” argument. The basic model output gives us all the estimated parameters, of interest here are the “r.(Intercept)” which is the growth rate when the temperature is at 0 and “r.Temp” which is the increase in the growth rate for every increase of temperature by 1.

Again we can get the fitted values for plotting:

datT$pred <- predict(mll) ggplot(datT,aes(x=Time,y=Abund,color=factor(Temp)))+ geom_point()+geom_line(aes(y=pred))

The cool thing with `mle2`

is that you can fit any models that you can imagine, as long as you are able to write down the log-likelihood functions. We will see this with an extension of the previous model. It is a common assumption in biology that species should have some optimum temperature, hence we can expect a bell-shape relation between population equilibrium and temperature. So we will add to the previous model a quadratic relation between the population equilibrium (“K”) and temperature.

#simulate some data mm <- sapply(Temp,function(x){ rpois(50,logF(time = 1:50,K = 100+20*x-x**2,n0=n0,r = 0.05 + 0.01 * x))}) #note that this time both K and r are given as vectors to represent their #relation with temperature #some reshaping datT <- melt(mm) names(datT) <- c("Time","Temp","Abund") datT$Temp <- Temp[datT$Temp] #the negative log-likelihood function LL <- function(k0,k1,k2,n0,r0,r1){ K <- k0 + k1*Temp + k2*Temp**2 r <- r0 + r1*Temp lbd <- K*n0*exp(r*Time)/(K+n0*(exp(r*Time)-1)) -sum(dpois(Abund,lbd,log=TRUE)) } (mll2 <- mle2(LL,data=datT,start=list(k0=100,k1=1,k2=-0.1,n0=10,r0=0.1,r1=0.1)))Call: mle2(minuslogl = LL, start = list(k0 = 100, k1 = 1, k2 = -0.1, n0 = 10, r0 = 0.1, r1 = 0.1), data = datT) Coefficients: k0 k1 k2 n0 r0 r1 100.47610915 20.09677780 -1.00836974 4.84940342 0.04960558 0.01018838 Log-likelihood: -1700.91

This time I defined the negative log-likelihood function to fit the model. This function takes as argument the model parameter. In this case we have 3 parameters to describe the quadratic relation between K and temperature, 1 parameter for the constant n0 and 2 parameters to describe the linear relation between r and temperature. The important thing to understand is what the last line is doing. Basically the `dpois`

call return the probabilities to get the observed abundance values (“Abund”) based on the expected mean value (“lbd”). The `LL`

function will return a single negative log-likelihood value (the sum) and the job of `mle2`

is to minimize this value. To do so `mle2`

will try many different parameter combination each time calling the LL function and when it reaches the minimum it will return the parameter set used. This is Maximum Likelihood Estimation (MLE). As before the model output gives us the estimated parameters, and we can use these to plot the fitted regressions:

cc <- coef(mll2) #sorry for the awful coding, to get the model predicted values we need #to code the relation by hand which is rather clumsy ... datT$pred <- melt(sapply(Temp,function(x) {logF(time=1:50,K=cc[1]+cc[2]*x+cc[3]*x**2,n0=cc[4],r=cc[5]+cc[6]*x)}))[,3] ggplot(datT,aes(x=Time,y=Abund,color=factor(Temp)))+ geom_point()+geom_line(aes(y=pred))

I showed three way by which you can include the effect of predictors on the parameters of some non-linear regression: nlsList, mle2 with formula and mle2 with customized negative log-likelihood function. The first option is super easy and might be enough in many cases, while the last option though more complex gives you (almost) infinite freedom in the models that you can fit.

Happy non-linear modelling!

Related Post

Related Post

all.equal(weather_data$Rainfall > 1, weather_data$RainToday == "Yes")

As a consequence, we are able so far to predict if tomorrow rainfall shall be above 1mm or not. Some more questions to be answered:

- In case of “at least moderate” Rainfall (arbitrarily defined as Rainfall > 15mm), how do our models perform?

In case of “at least moderate” rainfall, we would like to be as much reliable as possible in predicting {RainTomorrow = “Yes”}. Since RainTomorrow = “Yes” is perceived as the prediction of a potential threat of damages due to the rainfall, we have to alert Canberra’s citizens properly. That translates in having a very good specificity, as explained in the presecution of the analysis.

- Can we build additional models in order to predict further variables such as tomorrow’s Rainfall, Humidity3pm, WindGustSpeed, Sunshine, MinTemp and MaxTemp variables?

That is motivated by the fact that weather forecast comprises more than one prediction. For an example of the most typical ones, see:

Since Rainfall was one of the variable we drop off from our original dataset, we have to consider it back and identify the records which are associated to moderate rainfall. The following note applies.

In the prosecution of the analysis, we will not make the assumption to have to implement separate 9AM, 3PM and late evening predictions as done in part #2. For simplicity and brevity, we assume to have to arrange this new goals in the late evening where all predictors are available. Similarly as done before, we will not take into account the following variables:

Date, Location, RISK_MM, RainToday, Rainfall, WindDir9am, WindDir3pm

as Date and Location cannot be used as predictors; RISK_MM, RainToday, Rainfall are not used to make our task a little bit more difficult; we demonstrated in the first tutorial of this series that WinDir9am and WinDir3pm are not interesting predictors and they have associated a considerable number of NA values.

Having said that, we then start the moderate rainfall scenario analysis. At the same time, we prepare the dataset for the rest of the analysis.

weather_data6 <- subset(weather_data, select = -c(Date, Location, RISK_MM, RainToday, WindDir9am, WindDir3pm)) weather_data6$RainfallTomorrow <- c(weather_data6$Rainfall[2:nrow(weather_data6)], NA) weather_data6$Humidity3pmTomorrow <- c(weather_data6$Humidity3pm[2:nrow(weather_data6)], NA) weather_data6$WindGustSpeedTomorrow <- c(weather_data6$WindGustSpeed[2:nrow(weather_data6)], NA) weather_data6$SunshineTomorrow <- c(weather_data6$Sunshine[2:nrow(weather_data6)], NA) weather_data6$MinTempTomorrow <- c(weather_data6$MinTemp[2:nrow(weather_data6)], NA) weather_data6$MaxTempTomorrow <- c(weather_data6$MaxTemp[2:nrow(weather_data6)], NA) weather_data7 = weather_data6[complete.cases(weather_data6),] head(weather_data7)MinTemp MaxTemp Rainfall Evaporation Sunshine WindGustDir WindGustSpeed WindSpeed9am WindSpeed3pm 1 8.0 24.3 0.0 3.4 6.3 NW 30 6 20 2 14.0 26.9 3.6 4.4 9.7 ENE 39 4 17 3 13.7 23.4 3.6 5.8 3.3 NW 85 6 6 4 13.3 15.5 39.8 7.2 9.1 NW 54 30 24 5 7.6 16.1 2.8 5.6 10.6 SSE 50 20 28 6 6.2 16.9 0.0 5.8 8.2 SE 44 20 24 Humidity9am Humidity3pm Pressure9am Pressure3pm Cloud9am Cloud3pm Temp9am Temp3pm RainTomorrow 1 68 29 1019.7 1015.0 7 7 14.4 23.6 Yes 2 80 36 1012.4 1008.4 5 3 17.5 25.7 Yes 3 82 69 1009.5 1007.2 8 7 15.4 20.2 Yes 4 62 56 1005.5 1007.0 2 7 13.5 14.1 Yes 5 68 49 1018.3 1018.5 7 7 11.1 15.4 No 6 70 57 1023.8 1021.7 7 5 10.9 14.8 No RainfallTomorrow Humidity3pmTomorrow WindGustSpeedTomorrow SunshineTomorrow MinTempTomorrow 1 3.6 36 39 9.7 14.0 2 3.6 69 85 3.3 13.7 3 39.8 56 54 9.1 13.3 4 2.8 49 50 10.6 7.6 5 0.0 57 44 8.2 6.2 6 0.2 47 43 8.4 6.1 MaxTempTomorrow 1 26.9 2 23.4 3 15.5 4 16.1 5 16.9 6 18.2

hr_idx = which(weather_data7$RainfallTomorrow > 15)

It is important to understand what is the segmentation of moderate rainfall records in terms of training and testing set, as herein shown.

(train_hr <- hr_idx[hr_idx %in% train_rec])[1] 9 22 30 52 79 143 311

(test_hr <- hr_idx[!(hr_idx %in% train_rec)])[1] 3 99 239 251

We see that some of the “at least moderate” Rainfall records belong to the testing dataset, hence we can use it in order to have a measure based on unseen data by our model. We test the evening models with a test-set comprising such moderate rainfall records.

rain_test <- weather_data7[test_hr,] rain_testMinTemp MaxTemp Rainfall Evaporation Sunshine WindGustDir WindGustSpeed WindSpeed9am WindSpeed3pm 3 13.7 23.4 3.6 5.8 3.3 NW 85 6 6 99 14.5 24.2 0.0 6.8 5.9 SSW 61 11 20 250 3.0 11.1 0.8 1.4 0.2 W 35 0 13 263 -1.1 11.0 0.2 1.8 0.0 WNW 41 0 6 Humidity9am Humidity3pm Pressure9am Pressure3pm Cloud9am Cloud3pm Temp9am Temp3pm RainTomorrow 3 82 69 1009.5 1007.2 8 7 15.4 20.2 Yes 99 76 76 999.4 998.9 7 7 17.9 20.3 Yes 250 99 96 1024.4 1021.1 7 8 6.3 8.6 Yes 263 92 87 1014.4 1009.0 7 8 2.4 8.7 Yes RainfallTomorrow Humidity3pmTomorrow WindGustSpeedTomorrow SunshineTomorrow MinTempTomorrow 3 39.8 56 54 9.1 13.3 99 16.2 58 41 5.6 12.4 250 16.8 72 35 6.5 2.9 263 19.2 56 54 7.5 2.3 MaxTempTomorrow 3 15.5 99 19.9 250 9.5 263 11.6

Let us see how the first weather forecast evening model performs.

opt_cutoff <- 0.42 pred_test <- predict(mod_ev_c2_fit, rain_test, type="prob") prediction <- ifelse(pred_test$Yes >= opt_cutoff, "Yes", "No") data.frame(prediction = prediction, RainfallTomorrow = rain_test$RainfallTomorrow)prediction RainfallTomorrow 1 Yes 39.8 2 No 16.2 3 Yes 16.8 4 Yes 19.2

confusionMatrix(prediction, rain_test$RainTomorrow)Confusion Matrix and Statistics Reference Prediction No Yes No 0 1 Yes 0 3 Accuracy : 0.75 95% CI : (0.1941, 0.9937) No Information Rate : 1 P-Value [Acc > NIR] : 1 Kappa : 0 Mcnemar's Test P-Value : 1 Sensitivity : NA Specificity : 0.75 Pos Pred Value : NA Neg Pred Value : NA Prevalence : 0.00 Detection Rate : 0.00 Detection Prevalence : 0.25 Balanced Accuracy : NA 'Positive' Class : No

Then, the second evening weather forecast model.

opt_cutoff <- 0.56 pred_test <- predict(mod_ev_c3_fit, rain_test, type="prob") prediction <- ifelse(pred_test$Yes >= opt_cutoff, "Yes", "No") data.frame(prediction = prediction, RainfallTomorrow = rain_test$RainfallTomorrow)prediction RainfallTomorrow 1 Yes 39.8 2 Yes 16.2 3 No 16.8 4 Yes 19.2

confusionMatrix(prediction, rain_test$RainTomorrow)Confusion Matrix and Statistics Reference Prediction No Yes No 0 1 Yes 0 3 Accuracy : 0.75 95% CI : (0.1941, 0.9937) No Information Rate : 1 P-Value [Acc > NIR] : 1 Kappa : 0 Mcnemar's Test P-Value : 1 Sensitivity : NA Specificity : 0.75 Pos Pred Value : NA Neg Pred Value : NA Prevalence : 0.00 Detection Rate : 0.00 Detection Prevalence : 0.25 Balanced Accuracy : NA 'Positive' Class : No

For both evening forecast models, one of the testing set predictions is wrong. If we like to improve it, we have to step back to the tuning procedure and determine a decision threshold value more suitable to capture such scenarios. From the tables that we show in the previous part, we can try to lower the cutoff value to increase specificity, however likely implying to reduce accuracy.

In the previous part of this series, when discussing with the tuning of the decision threshold, we dealt with probabilities associated to the predicted RainTomorrow variable. The chances of having RainTomorrow == “Yes” are equivalent to the chance of rain. Hence the following utility function can be depicted at the purpose

chance_of_rain <- function(model, data_record){ chance_frac <- predict(mod_ev_c3_fit, data_record, type="prob")[, "Yes"] paste(round(chance_frac*100), "%", sep="") }

We try it out passing ten records of the testing dataset.

chance_of_rain(mod_ev_c3_fit, testing[1:10,])[1] "79%" "3%" "4%" "3%" "1%" "7%" "69%" "61%" "75%" "78%"

To build all the following models, we are going to use all the data available in order to capture the variability of an entire year. For brevity, we do not make comparison among models for same predicted variable and we do not show regression models diagnostic plots.

If the logistic regression model predicts RainTomorrow = “Yes”, we would like to take advantage of a linear regression model capable to predict the Rainfall value for tomorrow. In other words, we are just interested in records whose Rainfall outcome is greater than 1 mm.

weather_data8 = weather_data7[weather_data7$RainfallTomorrow > 1,] rf_fit <- lm(RainfallTomorrow ~ MaxTemp + Sunshine + WindGustSpeed - 1, data = weather_data8) summary(rf_fit)Call: lm(formula = RainfallTomorrow ~ MaxTemp + Sunshine + WindGustSpeed - 1, data = weather_data8) Residuals: Min 1Q Median 3Q Max -10.363 -4.029 -1.391 3.696 23.627 Coefficients: Estimate Std. Error t value Pr(>|t|) MaxTemp 0.3249 0.1017 3.196 0.002225 ** Sunshine -1.2515 0.2764 -4.528 2.88e-05 *** WindGustSpeed 0.1494 0.0398 3.755 0.000394 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 6.395 on 60 degrees of freedom Multiple R-squared: 0.6511, Adjusted R-squared: 0.6336 F-statistic: 37.32 on 3 and 60 DF, p-value: 9.764e-14

All predictors are reported as significant.

lm_pred <- predict(rf_fit, weather_data8) plot(x = seq_along(weather_data8$RainfallTomorrow), y = weather_data8$RainfallTomorrow, type='p', xlab = "observations", ylab = "RainfallTomorrow") legend("topright", c("actual", "predicted"), fill = c("black", "red")) points(x = seq_along(weather_data8$RainfallTomorrow), y = lm_pred, col='red')

The way we intend to use this model together with the logistic regression model predicting RainTomorrow factor variable is the following.

+----------------------+ +--------------------+ | logistic regression | | linear regression | | model +--> {RainTomorrow = Yes} -->--+ model for |----> {RainfallTomorrow prediction} | | | RainFallTomorrow | +--------+-------------+ +--------------------+ | | +---------------- {RainTomorrow = No} -->-- {Rainfall < 1 mm}

h3pm_fit <- lm(Humidity3pmTomorrow ~ Humidity3pm + Sunshine, data = weather_data7) summary(h3pm_fit)Call: lm(formula = Humidity3pmTomorrow ~ Humidity3pm + Sunshine, data = weather_data7) Residuals: Min 1Q Median 3Q Max -30.189 -9.117 -2.417 7.367 45.725 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 34.18668 5.45346 6.269 1.09e-09 *** Humidity3pm 0.37697 0.06973 5.406 1.21e-07 *** Sunshine -0.85027 0.33476 -2.540 0.0115 * --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 14.26 on 344 degrees of freedom Multiple R-squared: 0.2803, Adjusted R-squared: 0.2761 F-statistic: 66.99 on 2 and 344 DF, p-value: < 2.2e-16

All predictors are reported as significant.

lm_pred <- predict(h3pm_fit, weather_data7) plot(x = seq_along(weather_data7$Humidity3pmTomorrow), y = weather_data7$Humidity3pmTomorrow, type='p', xlab = "observations", ylab = "Humidity3pmTomorrow") legend("topright", c("actual", "predicted"), fill = c("black", "red")) points(x = seq_along(weather_data7$Humidity3pmTomorrow), y = lm_pred, col='red')

wgs_fit <- lm(WindGustSpeedTomorrow ~ WindGustSpeed + Pressure9am + Pressure3pm, data = weather_data7) summary(wgs_fit)Call: lm(formula = WindGustSpeedTomorrow ~ Pressure9am + Pressure3pm + WindGustSpeed, data = weather_data7) Residuals: Min 1Q Median 3Q Max -31.252 -7.603 -1.069 6.155 49.984 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 620.53939 114.83812 5.404 1.22e-07 *** Pressure9am 2.17983 0.36279 6.009 4.78e-09 *** Pressure3pm -2.76596 0.37222 -7.431 8.66e-13 *** WindGustSpeed 0.23365 0.05574 4.192 3.52e-05 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 11.29 on 343 degrees of freedom Multiple R-squared: 0.2628, Adjusted R-squared: 0.2563 F-statistic: 40.75 on 3 and 343 DF, p-value: < 2.2e-16

All predictors are reported as significant.

lm_pred <- predict(wgs_fit, weather_data7) plot(x = seq_along(weather_data7$WindGustSpeedTomorrow), y = weather_data7$WindGustSpeedTomorrow, type='p', xlab = "observations", ylab = "WindGustSpeedTomorrow") legend("topright", c("actual", "predicted"), fill = c("black", "red")) points(x = seq_along(weather_data7$WindGustSpeedTomorrow), y = lm_pred, col='red')

sun_fit <- lm(SunshineTomorrow ~ Sunshine*Humidity3pm + Cloud3pm + Evaporation + I(Evaporation^2) + WindGustSpeed - 1, data = weather_data7) summary(sun_fit)Call: lm(formula = SunshineTomorrow ~ Sunshine * Humidity3pm + Cloud3pm + Evaporation + I(Evaporation^2) + WindGustSpeed - 1, data = weather_data7) Residuals: Min 1Q Median 3Q Max -9.9701 -1.8230 0.6534 2.1907 7.0478 Coefficients: Estimate Std. Error t value Pr(>|t|) Sunshine 0.607984 0.098351 6.182 1.82e-09 *** Humidity3pm 0.062289 0.012307 5.061 6.84e-07 *** Cloud3pm -0.178190 0.082520 -2.159 0.031522 * Evaporation 0.738127 0.245356 3.008 0.002822 ** I(Evaporation^2) -0.050735 0.020510 -2.474 0.013862 * WindGustSpeed 0.036675 0.013624 2.692 0.007453 ** Sunshine:Humidity3pm -0.007704 0.002260 -3.409 0.000729 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 3.134 on 340 degrees of freedom Multiple R-squared: 0.8718, Adjusted R-squared: 0.8692 F-statistic: 330.3 on 7 and 340 DF, p-value: < 2.2e-16

All predictors are reported as significant. Very good adjusted R-squared.

lm_pred <- predict(sun_fit, weather_data7) plot(x = seq_along(weather_data7$SunshineTomorrow), y = weather_data7$SunshineTomorrow, type='p', xlab = "observations", ylab = "SunshineTomorrow") legend("topright", c("actual", "predicted"), fill = c("black", "red")) points(x = seq_along(weather_data7$SunshineTomorrow), y = lm_pred, col='red')

Above plot makes more evident that this model captures only a subset of the original Sunshine variance.

To give a more intuitive interpretation of the Sunshine information, it is suggestable to report a factor variable having levels:

{"Cloudy", "Mostly Cloudy", "Partly Cloudy", "Mostly Sunny", "Sunny"}

In doing that, we have furthermore to take into account tomorrow’s Cloud9am and Cloud3pm. For those quantitative variables, corresponding predictions are needed, and, at the purpose, the following linear regression models based on the Sunshine predictor can be depicted.

cloud9am_fit <- lm(Cloud9am ~ Sunshine, data = weather_data7) summary(cloud9am_fit)Call: lm(formula = Cloud9am ~ Sunshine, data = weather_data7) Residuals: Min 1Q Median 3Q Max -5.2521 -1.7635 -0.4572 1.6920 5.9127 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 8.51527 0.28585 29.79 <2e-16 *** Sunshine -0.58031 0.03298 -17.59 <2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 2.163 on 345 degrees of freedom Multiple R-squared: 0.4729, Adjusted R-squared: 0.4714 F-statistic: 309.6 on 1 and 345 DF, p-value: < 2.2e-16

All predictors are reported as significant.

lm_pred <- round(predict(cloud9am_fit, weather_data7)) lm_pred[lm_pred == 9] = 8 plot(x = weather_data7$Sunshine, y = weather_data7$Cloud9am, type='p', xlab = "Sunshine", ylab = "Cloud9am") legend("topright", c("actual", "predicted"), fill = c("black", "red")) points(x = weather_data7$Sunshine, y = lm_pred, col='red')

cloud3pm_fit <- lm(Cloud3pm ~ Sunshine, data = weather_data7) summary(cloud3pm_fit)Call: lm(formula = Cloud3pm ~ Sunshine, data = weather_data7) Residuals: Min 1Q Median 3Q Max -4.1895 -1.6230 -0.0543 1.4829 5.2883 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 8.01199 0.26382 30.37 <2e-16 *** Sunshine -0.50402 0.03044 -16.56 <2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 1.996 on 345 degrees of freedom Multiple R-squared: 0.4428, Adjusted R-squared: 0.4412 F-statistic: 274.2 on 1 and 345 DF, p-value: < 2.2e-16

All predictors are reported as significant.

lm_pred <- round(predict(cloud3pm_fit, weather_data7)) lm_pred[lm_pred == 9] = 8 plot(x = weather_data7$Sunshine, y = weather_data7$Cloud3pm, type='p', xlab = "Sunshine", ylab = "Cloud3pm") legend("topright", c("actual", "predicted"), fill = c("black", "red")) points(x = weather_data7$Sunshine, y = lm_pred, col='red')

Once predicted the SunshineTomorrow value, we can use it as predictor input in Cloud9am/Cloud3pm linear regression models to determine both tomorrow’s Cloud9am and Cloud3pm predictions. We have preferred such approach in place of building a model based on one of the following formulas:

Cloud9amTomorrow ~ Cloud9am Cloud9amTomorrow ~ Cloud9am + Sunshine Cloud3pmTomorrow ~ Cloud3pm

because they provide with a very low adjusted R-squared (approximately in the interval [0.07, 0.08]).

Based on the Cloud9am and Cloud3pm predictions, the computation of a new factor variable named as CloudConditions and having levels {“Cloudy”, “Mostly Cloudy”, “Partly Cloudy”, “Mostly Sunny”, “Sunny”} will be shown. As we observed before our Sunshine prediction captures only a a subset of the variance of the modeled variable, and, as a consequence, our cloud conditions prediction is expected to rarely (possibly never) report “Cloudy” and “Sunny”.

minTemp_fit <- lm(MinTempTomorrow ~ MinTemp + Humidity3pm , data = weather_data7) summary(minTemp_fit)Call: lm(formula = MinTempTomorrow ~ MinTemp + Humidity3pm, data = weather_data7) Residuals: Min 1Q Median 3Q Max -10.4820 -1.7546 0.3573 1.9119 10.0953 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 2.696230 0.520244 5.183 3.74e-07 *** MinTemp 0.844813 0.027823 30.364 < 2e-16 *** Humidity3pm -0.034404 0.009893 -3.478 0.000571 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 3.112 on 344 degrees of freedom Multiple R-squared: 0.7327, Adjusted R-squared: 0.7311 F-statistic: 471.4 on 2 and 344 DF, p-value: < 2.2e-16

All predictors are reported as significant.

lm_pred <- predict(minTemp_fit, weather_data7) plot(x = weather_data7$Sunshine, y = weather_data7$MinTemp, type='p', xlab = "Sunshine", ylab = "MinTemp") legend("topright", c("actual", "fitted"), fill = c("black", "red")) points(x = weather_data7$Sunshine, y = lm_pred, col='red')

maxTemp_fit <- lm(MaxTempTomorrow ~ MaxTemp + Evaporation, data = weather_data7) summary(maxTemp_fit)Call: lm(formula = MaxTempTomorrow ~ MaxTemp + Evaporation, data = weather_data7) Residuals: Min 1Q Median 3Q Max -13.217 -1.609 0.137 2.025 9.708 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 2.69811 0.56818 4.749 3.01e-06 *** MaxTemp 0.82820 0.03545 23.363 < 2e-16 *** Evaporation 0.20253 0.08917 2.271 0.0237 * --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 3.231 on 344 degrees of freedom Multiple R-squared: 0.7731, Adjusted R-squared: 0.7718 F-statistic: 586 on 2 and 344 DF, p-value: < 2.2e-16

All predictors are reported as significant.

lm_pred <- predict(maxTemp_fit, weather_data7) plot(x = weather_data7$Sunshine, y = weather_data7$MaxTemp, type='p', xlab = "Sunshine", ylab = "MaxTemp") legend("topright", c("actual", "fitted"), fill = c("black", "red")) points(x = weather_data7$Sunshine, y = lm_pred, col='red')

Based on second reference given at the end, we have the following mapping between a descriptive cloud conditions string and the cloud coverage in ”oktas,” which are a unit of eights.

Cloudy: 8/8 opaque clouds Mostly Cloudy: 6/8 - 7/8 opaque clouds Partly Cloudy: 3/8 - 5/8 opaque clouds Mostly Sunny: 1/8 - 2/8 opaque clouds Sunny: 0/8 opaque clouds

We can figure out the following utility function to help.

computeCloudConditions = function(cloud_9am, cloud_3pm) { cloud_avg = min(round((cloud_9am + cloud_3pm)/2), 8) cc_str = NULL if (cloud_avg == 8) { cc_str = "Cloudy" } else if (cloud_avg >= 6) { cc_str = "Mostly Cloudy" } else if (cloud_avg >= 3) { cc_str = "Partly Cloudy" } else if (cloud_avg >= 1) { cc_str = "Mostly Sunny" } else if (cloud_avg < 1) { cc_str = "Sunny" } cc_str }

Starting from a basic example of weather dataset, we were able to build several regression models. The first one, based on logistic regression, is capable of predict the RainTomorrow factor variable. The linear regression models are to predict the Rainfall, Humidity3pm, WindGustSpeed, MinTemp, MaxTemp, CloudConditions weather metrics. The overall picture we got is the following.

+----------------+ | | | +------> RainTomorrow | | | Weather +------> Rainfall | | {Today's weather data} ---->----+ Forecast +------> Humidity3pm | | | Models +------> WindGustSpeed +-----------------------+ | | | linear regression | | +------> Sunshine ------>-------+ models +-->--{Cloud9am, Cloud3pm}--+ | | | Cloud9am and Cloud3pm | | | +------> MinTemp +-----------------------+ | | | | | +------> MaxTemp | | | | +----------------+ +-----------------------+ | | computing new | | { Cloudy, Mostly Cloudy, Partly Cloudy, Mostly Sunny, Sunny} <-----+ factor variable +-------------<-------------+ | CloudConditions | +-----------------------+

As said before, if RainTomorrow == “Yes” Rainfall is explicitely given, otherwise un upper bound Rainfall < 1 mm is. Chance of rain is computed only if RainTomorrow prediction is “Yes”. The Humidity3pm prediction is taken as humidity prediction for the whole day, in general.

weather_report <- function(today_record, rain_tomorrow_model, cutoff) { # RainTomorrow prediction rainTomorrow_prob <- predict(rain_tomorrow_model, today_record, type="prob") rainTomorrow_pred = ifelse(rainTomorrow_prob$Yes >= cutoff, "Yes", "No") # Rainfall prediction iff RainTomorrow prediction is Yes; chance of rain probability rainfall_pred <- NA chance_of_rain <- NA if (rainTomorrow_pred == "Yes") { rainfall_pred <- round(predict(rf_fit, today_record), 1) chance_of_rain <- round(rainTomorrow_prob$Yes*100) } # WindGustSpeed prediction wgs_pred <- round(predict(wgs_fit, today_record), 1) # Humidity3pm prediction h3pm_pred <- round(predict(h3pm_fit, today_record), 1) # sunshine prediction is used to fit Cloud9am and Cloud3pm sun_pred <- predict(sun_fit, today_record) cloud9am_pred <- min(round(predict(cloud9am_fit, data.frame(Sunshine=sun_pred))), 8) cloud3pm_pred <- min(round(predict(cloud3pm_fit, data.frame(Sunshine=sun_pred))), 8) # a descriptive cloud conditions string is computed CloudConditions_pred <- computeCloudConditions(cloud9am_pred, cloud3pm_pred) # MinTemp prediction minTemp_pred <- round(predict(minTemp_fit, today_record), 1) # MaxTemp prediction maxTemp_pred <- round(predict(maxTemp_fit, today_record), 1) # converting all numeric predictions to strings if (is.na(rainfall_pred)) { rainfall_pred_str <- "< 1 mm" } else { rainfall_pred_str <- paste(rainfall_pred, "mm", sep = " ") } if (is.na(chance_of_rain)) { chance_of_rain_str <- "" } else { chance_of_rain_str <- paste(chance_of_rain, "%", sep="") } wgs_pred_str <- paste(wgs_pred, "Km/h", sep= " ") h3pm_pred_str <- paste(h3pm_pred, "%", sep = "") minTemp_pred_str <- paste(minTemp_pred, "°C", sep= "") maxTemp_pred_str <- paste(maxTemp_pred, "°C", sep= "") report <- data.frame(Rainfall = rainfall_pred_str, ChanceOfRain = chance_of_rain_str, WindGustSpeed = wgs_pred_str, Humidity = h3pm_pred_str, CloudConditions = CloudConditions_pred, MinTemp = minTemp_pred_str, MaxTemp = maxTemp_pred_str) report }

Sure there are confidence and prediction intervals associated to our predictions. However, since we intend to report our forecasts to Canberra’s citizens, our message should be put in simple words to reach everybody and not just statisticians.

Finally we can try our tomorrow weather forecast report out.

(tomorrow_report <- weather_report(weather_data[10,], mod_ev_c3_fit, 0.56))Rainfall ChanceOfRain WindGustSpeed Humidity CloudConditions MinTemp MaxTemp 1 < 1 mm 36.9 Km/h 39.7% Partly Cloudy 8.7°C 22.7°C

(tomorrow_report <- weather_report(weather_data[32,], mod_ev_c3_fit, 0.56))Rainfall ChanceOfRain WindGustSpeed Humidity CloudConditions MinTemp MaxTemp 1 < 1 mm 48 Km/h 41.3% Partly Cloudy 10.9°C 24.8°C

(tomorrow_report <- weather_report(weather_data[50,], mod_ev_c3_fit, 0.56))Rainfall ChanceOfRain WindGustSpeed Humidity CloudConditions MinTemp MaxTemp 1 8.3 mm 59% 36.9 Km/h 60.1% Partly Cloudy 10.8°C 22.2°C

(tomorrow_report <- weather_report(weather_data[100,], mod_ev_c3_fit, 0.56))Rainfall ChanceOfRain WindGustSpeed Humidity CloudConditions MinTemp MaxTemp 1 5.4 mm 71% 46.7 Km/h 51.3% Partly Cloudy 11.2°C 20.3°C

(tomorrow_report <- weather_report(weather_data[115,], mod_ev_c3_fit, 0.56))Rainfall ChanceOfRain WindGustSpeed Humidity CloudConditions MinTemp MaxTemp 1 < 1 mm 46.4 Km/h 35.7% Mostly Sunny 12.3°C 25.1°C

(tomorrow_report <- weather_report(weather_data[253,], mod_ev_c3_fit, 0.56))Rainfall ChanceOfRain WindGustSpeed Humidity CloudConditions MinTemp MaxTemp 1 9.4 mm 81% 52.4 Km/h 65.2% Partly Cloudy 1.3°C 10.3°C

(tomorrow_report <- weather_report(weather_data[311,], mod_ev_c3_fit, 0.56))Rainfall ChanceOfRain WindGustSpeed Humidity CloudConditions MinTemp MaxTemp 1 < 1 mm 40.7 Km/h 44.2% Mostly Cloudy 6.9°C 16.4°C

Please note that some records of the original weather dataset may show NA’s values and our regression models did not cover a predictor can be as such. To definitely evaluate the accuracy of our weather forecast report, we would need to check its unseen data predictions with the occurred tomorrow’s weather. Further, to improve the adjusted R-squared metric of our linear regression models is a potential area of investigation.

Starting from a basic weather dataset, we went through an interesting datastory involving exploratory analysis and models building. We spot strengths and weaknesses of our prediction models. We also learned some insights of the meteorology terminology.

We hope you enjoyed this tutorial series on weather forecast. If you have any questions, please feel free to comment below.

**References**

Related Post

Related Post

It took me some time to figure out the full process from A to Z. I was often left with remaining questions:

- What to do with the import of external libraries?
- What about documentation?
- How to actually install my package?
- How can I share it with others?

Therefore, I will explain how I made my first R package and which methods I found helpful. Of course, there are different approaches out there, some of them very well documented, like for example this one – but the following *3 easy steps* worked for me, so maybe they will help you too getting your first R package ready, installed and online quickly.

My first package is a wrapper for the FlightStats API. You can sign up for FlightStats to get a free trial API key (which unfortunately works for one month only). However, you can of course make a wrapper for any API you like or make a non-API related package. Two links I found very useful for getting started making my first package are this one and this one (regarding the storing of API keys). Also, I learned a lot using this package as an example.

First of all, we need to have all the functions (and possibly dataframes & other objects) ready in our R environment. For example, imagine we want to include these two functions to store the API key and an app ID:

setAPIKey <- function(){ input = readline(prompt="Enter your FlightStats API Key: ") Sys.setenv(flightstats_api_key = input) # this is a more simple way of storing API keys, it saves it in the .Rprofile file, however this is only temporary - meaning next session the login details will have to be provided again. See below how to store login details in a more durable way. } setAppId <- function(){ input = readline(prompt="Enter your FlightStats appID: ") Sys.setenv(flightstats_app_id = input) }

Now you can retrieve the login in details like this:

ID = Sys.getenv("flightstats_app_id") # if the key does not exist, this returns an empty string (""), in this case the user should be prompted to use the setAPIKey() and setAppID() functions KEY = Sys.getenv("flightstats_api_key")

However, if you would like the login details to still be accessible the next time you open R, you can define your login details in the *.Renviron* file (instead of the *.Rprofile* file). The *.Renviron* file can be found in your home directory. You can find the path to your home directory by running `Sys.getenv("HOME")`

. For more info see here. The final function to store the API key in the FlightsR package looked like this.

Side note: I actually did not know this before, but find it quite handy: If you want to know the script of any function in R, you can find it by just typing the function name and hit enter. For example:

> read.csv function (file, header = TRUE, sep = ",", quote = "\"", dec = ".", fill = TRUE, comment.char = "", ...) read.table(file = file, header = header, sep = sep, quote = quote, dec = dec, fill = fill, comment.char = comment.char, ...) <bytecode: 0x029e046c> <environment: namespace:utils>

Let’s make another function that we can add to the package. For example, here is a simple function to list all airlines:

listAirlines <- function(activeOnly=TRUE){ ID = Sys.getenv("flightstats_app_id") if (ID == ""){ stop("Please set your FlightStats AppID and API Key with the setAPIKey() and setAppId() function. You can obtain these from https://developer.flightstats.com.") } KEY = Sys.getenv("flightstats_api_key") if (ID == ""){ stop("Please set your FlightStats AppID and API Key with the setAPIKey() and setAppId() function. You can obtain these from https://developer.flightstats.com.") } if(missing(activeOnly)){ choice = "active" } if(activeOnly == FALSE) { choice = "all" } else { choice = "active" } link = paste0("https://api.flightstats.com/flex/airlines/rest/v1/json/",choice,"?appId=",ID,"&appKey=",KEY) dat = getURL(link) dat_list <- fromJSON(dat) airlines <- dat_list$airlines return(airlines) }

Use `package.skeleton()`

, devtools & RoxyGen2 to let R prepare the necessary documents for you

package.skeleton(name = "FlightR", list = c("listAirlines","listAirports","scheduledFlights","scheduledFlightsFullDay","searchAirline","searchAirport","setAPIKey","setAppId"))

That’s it! Now in your working directory folder there should be a new folder with the name you just gave to your package.

Now, what is handy from the function above is that it creates the folders and files you need in a new package folder (“FlightsR” in this case). In the `/R`

folder you see now that every function you added has its own .R script and in the `/man`

folder there is an .Rd file for each of the functions.

You can now go and manually change everything in these files that needs to be changed (documentation needs to be added, the import of external packages to be defined, etc.) – or use roxygen2 and devtools to do it for you. Roxygen2 will complete the documentation in each .Rd file correctly and will create a *NAMESPACE* file for you. To do this, make sure you *delete the current incomplete files* (this is, all the files in the `/man`

folder and the *NAMESPACE* file), otherwise you will get an error when you use the `document()`

function later.

Now extra information needs to be added in the functions (for example, what are the *parameters* of the function, an *example* usage, necessary *imports*, etc.), in the following way:

#' Function searches a specific airline by IATA code #' #' @param value character, an airline IATA code #' @return data.frame() with the airline #' #' @author Emelie Hofland, \email{emelie_hofland@hotmail.com} #' #' @examples #' searchAirline("FR") #' #' @import RCurl #' @import jsonlite #' @export #' searchAirline <- function(value){ ID = Sys.getenv("flightstats_app_id") if (ID == ""){ stop("Please set your FlightStats AppID and API Key with the setAPIKey() and setAppId() function. You can obtain these from https://developer.flightstats.com.") } KEY = Sys.getenv("flightstats_api_key") if (ID == ""){ stop("Please set your FlightStats AppID and API Key with the setAPIKey() and setAppId() function. You can obtain these from https://developer.flightstats.com.") } link = paste0("https://api.flightstats.com/flex/airlines/rest/v1/json/iata/",toupper(value),"?appId=",ID,"&appKey=",KEY) dat <- getURL(link) dat_list <- fromJSON(dat) result <- dat_list$airlines if (length(result)==0){ warning("Please make sure that you provide a valid airline IATA code.") } return(result) }

Do not forget to add the `@export`

, otherwise your functions will not be there when you open your package!

Now, when you have added this information for all your functions, make sure that devtools and roxygen2 are installed.

install.packages("roxygen2") install.packages("devtools")

Make sure your working directory is set to the folder of your package and run the following commands in R:

library(devtools) # to automatically generate the documentation: document() # to build the package build() # to install the package install()

*Voila!* You are done. I do not know if it is necessary, but just to be sure I restarted R at this point. In a new session you can now run `library(YourPackageName)`

and this should work.

To adjust functions, you can just change things in the functions in the package and re-run the `document()`

, `build()`

and `install()`

commands.

Note: These steps assume that you have Git installed and configured on your PC.

1) Create a new repository in your github account.

2) Create and copy the https link to clone the repo on your PC.

3) Go to the folder on your PC where you want to save your repo, open the command line interface & type:

$ git clone https://github.com/YourGithub/YourPackage.git

4) Copy all the files from your package in the folder and run:

$ git add . $ git commit -m "whatever message you want to add" $ git push origin master

5) *Voila* – now your package should be on GitHub!

Now people can download & install your package straight from GitHub or GitLab – the devtools package has a function for this:

if (!require(devtools)) { install.packages('devtools') } # If your repo is on GitHub: devtools::install_github('YourGithub/YourPackage') # If your repo is a public repo on GitLab: devtools::install_git("https://gitlab/link/to/your/repo") # If your repo is a private repo on GitLab: devtools::install_git("https://emelie.hofland:password@gitlab/link/to/your/repo.git")

Post a comment below if you have a question.

Related Post

Related Post

As default, caret uses 0.5 as threshold decision value to be used as probability cutoff value. We are going to show how to tune such decision threshold to achieve a better training set accuracy. We then compare how the testing set accuracy changes accordingly.

We load previously saved training, testing datasets together with the models we have already determined.

suppressPackageStartupMessages(library(caret)) suppressPackageStartupMessages(library(ROCR)) set.seed(1023) readRDS(file="wf_log_reg_part2.rds")

The following tuning procedure explores a pool of cutoff values giving back a table reporting {cutoff, accuracy, specificity} triplets. Its inspection allows to identify the cutoff value providing with the highest accuracy. In case of a tie in accuracy value, we choose the lowest cutoff value of such.

glm.tune <- function(glm_model, dataset) { results <- data.frame() for (q in seq(0.2, 0.8, by = 0.02)) { fitted_values <- glm_model$finalModel$fitted.values prediction <- ifelse(fitted_values >= q, "Yes", "No") cm <- confusionMatrix(prediction, dataset$RainTomorrow) accuracy <- cm$overall["Accuracy"] specificity <- cm$byClass["Specificity"] results <- rbind(results, data.frame(cutoff=q, accuracy=accuracy, specificity = specificity)) } rownames(results) <- NULL results }

In the utility function above, we included also the specificity metrics. Specificity is defined as TN/(TN+FP) and in the next part of this series we will discuss some aspects of. Let us show an example of confusion matrix with some metrics computation to clarify the terminology.

Reference Prediction No Yes No 203 42 Yes 0 3 Accuracy : 0.8306 = (TP+TN)/(TP+TN+FP+FN) = (203+3)/(203+3+42+0) Sensitivity : 1.00000 = TP/(TP+FN) = 203/(203+0) Specificity : 0.06667 = TN/(TN+FP) = 3/(3+42) Pos Pred Value : 0.82857 = TP/(TP+FP) = 203/(203+42) Neg Pred Value : 1.00000 = TN/(TN+FN) = 3/(3+0)

Please note that the “No” column is associated to a positive outcome, whilst “Yes” to a negative one. Specificity measure how good we are in predict that {RainTomorrow = “No”} and actually it is and this is something that we are going to investigate in next post of this series.

The logistic regression models to be tuned are:

mod9am_c1_fit: RainTomorrow ~ Cloud9am + Humidity9am + Pressure9am + Temp9am mod3pm_c1_fit: RainTomorrow ~ Cloud3pm + Humidity3pm + Pressure3pm + Temp3pm mod_ev_c2_fit: RainTomorrow ~ Cloud3pm + Humidity3pm + Pressure3pm + Temp3pm + WindGustDir mod_ev_c3_fit: RainTomorrow ~ Pressure3pm + Sunshine

In the following analysis, for all models we determine the cutoff value providing with the highest accuracy.

glm.tune(mod9am_c1_fit, training)cutoff accuracy specificity 1 0.20 0.7822581 0.73333333 2 0.22 0.7862903 0.71111111 3 0.24 0.7943548 0.68888889 4 0.26 0.8064516 0.64444444 5 0.28 0.8104839 0.64444444 6 0.30 0.8145161 0.60000000 7 0.32 0.8185484 0.57777778 8 0.34 0.8185484 0.51111111 9 0.36 0.8104839 0.46666667 10 0.38 0.8266129 0.46666667 11 0.40 0.8266129 0.40000000 12 0.42 0.8306452 0.40000000 13 0.44 0.8266129 0.37777778 14 0.46 0.8346774 0.35555556 15 0.48 0.8467742 0.33333333 16 0.50 0.8508065 0.31111111 17 0.52 0.8427419 0.26666667 18 0.54 0.8467742 0.26666667 19 0.56 0.8346774 0.20000000 20 0.58 0.8387097 0.20000000 21 0.60 0.8387097 0.20000000 22 0.62 0.8346774 0.17777778 23 0.64 0.8387097 0.17777778 24 0.66 0.8346774 0.15555556 25 0.68 0.8387097 0.15555556 26 0.70 0.8387097 0.13333333 27 0.72 0.8387097 0.13333333 28 0.74 0.8346774 0.11111111 29 0.76 0.8266129 0.06666667 30 0.78 0.8266129 0.06666667 31 0.80 0.8306452 0.06666667

The optimal cutoff value is equal to 0.5. In this case, we find the same cutoff value as the default the caret package takes advantage of for logistic regression models.

opt_cutoff <- 0.5 pred_test <- predict(mod9am_c1_fit, testing, type = "prob") prediction = ifelse(pred_test$Yes >= opt_cutoff, "Yes", "No") confusionMatrix(prediction, testing$RainTomorrow)Confusion Matrix and Statistics Reference Prediction No Yes No 80 12 Yes 6 7 Accuracy : 0.8286 95% CI : (0.7427, 0.8951) No Information Rate : 0.819 P-Value [Acc > NIR] : 0.4602 Kappa : 0.3405 Mcnemar's Test P-Value : 0.2386 Sensitivity : 0.9302 Specificity : 0.3684 Pos Pred Value : 0.8696 Neg Pred Value : 0.5385 Prevalence : 0.8190 Detection Rate : 0.7619 Detection Prevalence : 0.8762 Balanced Accuracy : 0.6493 'Positive' Class : No

Then, tuning the 3PM model.

glm.tune(mod3pm_c1_fit, training)cutoff accuracy specificity 1 0.20 0.8064516 0.7777778 2 0.22 0.8185484 0.7555556 3 0.24 0.8225806 0.7333333 4 0.26 0.8306452 0.6888889 5 0.28 0.8467742 0.6666667 6 0.30 0.8467742 0.6444444 7 0.32 0.8427419 0.6222222 8 0.34 0.8669355 0.6222222 9 0.36 0.8709677 0.6222222 10 0.38 0.8629032 0.5777778 11 0.40 0.8669355 0.5777778 12 0.42 0.8669355 0.5555556 13 0.44 0.8548387 0.4666667 14 0.46 0.8548387 0.4444444 15 0.48 0.8588710 0.4444444 16 0.50 0.8669355 0.4444444 17 0.52 0.8629032 0.4222222 18 0.54 0.8669355 0.4222222 19 0.56 0.8669355 0.3777778 20 0.58 0.8669355 0.3777778 21 0.60 0.8588710 0.3333333 22 0.62 0.8548387 0.3111111 23 0.64 0.8508065 0.2888889 24 0.66 0.8467742 0.2666667 25 0.68 0.8387097 0.2222222 26 0.70 0.8387097 0.2222222 27 0.72 0.8346774 0.2000000 28 0.74 0.8387097 0.1777778 29 0.76 0.8346774 0.1555556 30 0.78 0.8346774 0.1555556 31 0.80 0.8306452 0.1111111

The optimal cutoff value is equal to 0.36.

opt_cutoff <- 0.36 pred_test <- predict(mod3pm_c1_fit, testing, type="prob") prediction = ifelse(pred_test$Yes >= opt_cutoff, "Yes", "No") confusionMatrix(prediction, testing$RainTomorrow)Confusion Matrix and Statistics Reference Prediction No Yes No 81 5 Yes 5 14 Accuracy : 0.9048 95% CI : (0.8318, 0.9534) No Information Rate : 0.819 P-Value [Acc > NIR] : 0.0112 Kappa : 0.6787 Mcnemar's Test P-Value : 1.0000 Sensitivity : 0.9419 Specificity : 0.7368 Pos Pred Value : 0.9419 Neg Pred Value : 0.7368 Prevalence : 0.8190 Detection Rate : 0.7714 Detection Prevalence : 0.8190 Balanced Accuracy : 0.8394 'Positive' Class : No

We improved the test accuracy, previously resulting as equal to 0.8857 and now equal to 0.9048 (please take a look at part #2 of this tutorial to know about the test accuracy with 0.5 cutoff value). Then, the first evening hours model.

glm.tune(mod_ev_c2_fit, training)cutoff accuracy specificity 1 0.20 0.8387097 0.8444444 2 0.22 0.8467742 0.8222222 3 0.24 0.8548387 0.8222222 4 0.26 0.8629032 0.8000000 5 0.28 0.8588710 0.7333333 6 0.30 0.8709677 0.7333333 7 0.32 0.8790323 0.7111111 8 0.34 0.8830645 0.6888889 9 0.36 0.8870968 0.6666667 10 0.38 0.8870968 0.6666667 11 0.40 0.8951613 0.6666667 12 0.42 0.8991935 0.6666667 13 0.44 0.8991935 0.6666667 14 0.46 0.8951613 0.6444444 15 0.48 0.8951613 0.6222222 16 0.50 0.8951613 0.6222222 17 0.52 0.8951613 0.6000000 18 0.54 0.8911290 0.5777778 19 0.56 0.8911290 0.5777778 20 0.58 0.8951613 0.5777778 21 0.60 0.8911290 0.5555556 22 0.62 0.8870968 0.5111111 23 0.64 0.8830645 0.4888889 24 0.66 0.8830645 0.4666667 25 0.68 0.8790323 0.4222222 26 0.70 0.8709677 0.3555556 27 0.72 0.8548387 0.2666667 28 0.74 0.8508065 0.2444444 29 0.76 0.8427419 0.1777778 30 0.78 0.8387097 0.1555556 31 0.80 0.8387097 0.1555556

The optimal cutoff value is equal to 0.42.

opt_cutoff <- 0.42 pred_test <- predict(mod_ev_c2_fit, testing, type="prob") prediction = ifelse(pred_test$Yes >= opt_cutoff, "Yes", "No") confusionMatrix(prediction, testing$RainTomorrow)Confusion Matrix and Statistics Reference Prediction No Yes No 82 8 Yes 4 11 Accuracy : 0.8857 95% CI : (0.8089, 0.9395) No Information Rate : 0.819 P-Value [Acc > NIR] : 0.04408 Kappa : 0.58 Mcnemar's Test P-Value : 0.38648 Sensitivity : 0.9535 Specificity : 0.5789 Pos Pred Value : 0.9111 Neg Pred Value : 0.7333 Prevalence : 0.8190 Detection Rate : 0.7810 Detection Prevalence : 0.8571 Balanced Accuracy : 0.7662 'Positive' Class : No

We did not improve the accuracy as the default cutoff value provides with the same. Other metrics changed and in particular we notice that the sensitivity decreased a little bit while the positive predicted value improved. Then the second evening hours model.

glm.tune(mod_ev_c3_fit, training)cutoff accuracy specificity 1 0.20 0.8225806 0.8000000 2 0.22 0.8145161 0.7333333 3 0.24 0.8064516 0.6666667 4 0.26 0.8145161 0.6666667 5 0.28 0.8145161 0.6222222 6 0.30 0.8185484 0.6222222 7 0.32 0.8266129 0.6000000 8 0.34 0.8185484 0.5555556 9 0.36 0.8306452 0.5333333 10 0.38 0.8467742 0.5333333 11 0.40 0.8467742 0.5111111 12 0.42 0.8427419 0.4666667 13 0.44 0.8508065 0.4666667 14 0.46 0.8467742 0.4222222 15 0.48 0.8467742 0.4000000 16 0.50 0.8508065 0.4000000 17 0.52 0.8548387 0.4000000 18 0.54 0.8508065 0.3777778 19 0.56 0.8669355 0.3777778 20 0.58 0.8669355 0.3555556 21 0.60 0.8629032 0.3333333 22 0.62 0.8588710 0.3111111 23 0.64 0.8548387 0.2888889 24 0.66 0.8508065 0.2666667 25 0.68 0.8467742 0.2444444 26 0.70 0.8387097 0.2000000 27 0.72 0.8387097 0.1777778 28 0.74 0.8346774 0.1555556 29 0.76 0.8306452 0.1333333 30 0.78 0.8306452 0.1333333 31 0.80 0.8306452 0.1333333

The optimal cutoff value is equal to 0.56.

opt_cutoff <- 0.56 pred_test <- predict(mod_ev_c3_fit, testing, type="prob") prediction = ifelse(pred_test$Yes >= opt_cutoff, "Yes", "No") confusionMatrix(prediction, testing$RainTomorrow)Confusion Matrix and Statistics Reference Prediction No Yes No 82 8 Yes 4 11 Accuracy : 0.8857 95% CI : (0.8089, 0.9395) No Information Rate : 0.819 P-Value [Acc > NIR] : 0.04408 Kappa : 0.58 Mcnemar's Test P-Value : 0.38648 Sensitivity : 0.9535 Specificity : 0.5789 Pos Pred Value : 0.9111 Neg Pred Value : 0.7333 Prevalence : 0.8190 Detection Rate : 0.7810 Detection Prevalence : 0.8571 Balanced Accuracy : 0.7662 'Positive' Class : No

In this case, we did not improved the accuracy of this model. Other metrics changed and in particular we notice that the sensitivity is slightly higher than the default cutoff driven value (0.9419), and the positive predicted value decreased a little bit (for default cutoff was 0.9205).

To have a visual understanding of the classification performances and compute further metrics, we are going to take advantage of the ROCr package facilities. We plot ROC curves and determine the corresponding AUC value. That is going to be done for each model. The function used to accomplish with this task is the following.

glm.perf.plot <- function (prediction, cutoff) { perf <- performance(prediction, measure = "tpr", x.measure = "fpr") par(mfrow=(c(1,2))) plot(perf, col="red") grid() perf <- performance(prediction, measure = "acc", x.measure = "cutoff") plot(perf, col="red") abline(v = cutoff, col="green") grid() auc_res <- performance(prediction, "auc") auc_res@y.values[[1]] }

In the following, each model will be considered, AUC value printed out and ROC plots given.

mod9am_pred_prob <- predict(mod9am_c1_fit, testing, type="prob") mod9am_pred_resp <- prediction(mod9am_pred_prob$Yes, testing$RainTomorrow) glm.perf.plot(mod9am_pred_resp, 0.5)[1] 0.8004896

mod3pm_pred_prob <- predict(mod3pm_c1_fit, testing, type="prob") mod3pm_pred_resp <- prediction(mod3pm_pred_prob$Yes, testing$RainTomorrow) glm.perf.plot(mod3pm_pred_resp, 0.36)[1] 0.9155447

The 3PM model shows the highest AUC value among all the models.

mod_ev_c2_prob <- predict(mod_ev_c2_fit, testing, type="prob") mod_ev_c2_pred_resp <- prediction(mod_ev_c2_prob$Yes , testing$RainTomorrow) glm.perf.plot(mod_ev_c2_pred_resp, 0.42)[1] 0.8390453

mod_ev_c3_prob <- predict(mod_ev_c3_fit, testing, type="prob") mod_ev_c3_pred_resp <- prediction(mod_ev_c3_prob$Yes, testing$RainTomorrow) glm.perf.plot(mod_ev_c3_pred_resp, 0.56)[1] 0.8886169

By tuning the decision threshold, we were able to improve training and testing set accuracy. It turned out the 3PM model achieved a satisfactory accuracy. ROC plots gave us an understanding of how true/false positive rates vary and what is the accuracy obtained by a specific decision threshold value (i.e. cutoff value). Ultimately, AUC values were reported.

If you have any questions, please feel free to comment below.

Related Post

Related Post

Last January, Tensorflow for R was released, which provided access to the Tensorflow API from R. However, for most R users, the interface was not very R like.

Take a look at this code chunk for training a model:

cross_entropy <- tf$reduce_mean(-tf$reduce_sum(y_ * tf$log(y_conv), reduction_indices=1L)) train_step <- tf$train$AdamOptimizer(1e-4)$minimize(cross_entropy) correct_prediction <- tf$equal(tf$argmax(y_conv, 1L), tf$argmax(y_, 1L)) accuracy <- tf$reduce_mean(tf$cast(correct_prediction, tf$float32)) sess$run(tf$global_variables_initializer()) for (i in 1:20000) { batch <- mnist$train$next_batch(50L) if (i %% 100 == 0) { train_accuracy <- accuracy$eval(feed_dict = dict( x = batch[[1]], y_ = batch[[2]], keep_prob = 1.0)) cat(sprintf("step %d, training accuracy %g\n", i, train_accuracy)) } train_step$run(feed_dict = dict( x = batch[[1]], y_ = batch[[2]], keep_prob = 0.5)) } test_accuracy % fit( x = train_x, y = train_y, epochs=epochs, batch_size=batch_size, validation_data=valid)

So if you are still with me, let me show you how to build deep learning models using R, Keras, and Tensorflow together. You will find a Github repo that contains the code and data you will need. Included is an R notebook that walks through building an image classifier (telling cat from dog), but can easily be generalized to other images. The walk through includes advanced methods that are commonly used for production deep learning work including:

- augmenting data
- using the bottleneck features of a pre-trained network
- fine-tuning the top layers of a pre-trained network
- saving weights for your models

The R interface to Keras truly makes it easy to build deep learning models in R. Here are some code snippets to illustrate how intuitive and useful Keras for R is:

To load picture from a folder:

train_generator <- flow_images_from_directory(train_directory, generator = image_data_generator(), target_size = c(img_width, img_height), color_mode = "rgb", class_mode = "binary", batch_size = batch_size, shuffle = TRUE, seed = 123)

To define a simple convolutional neural network:

model % layer_conv_2d(filter = 32, kernel_size = c(3,3), input_shape = c(img_width, img_height, 3)) %>% layer_activation("relu") %>% layer_max_pooling_2d(pool_size = c(2,2)) %>% layer_conv_2d(filter = 32, kernel_size = c(3,3)) %>% layer_activation("relu") %>% layer_max_pooling_2d(pool_size = c(2,2)) %>% layer_conv_2d(filter = 64, kernel_size = c(3,3)) %>% layer_activation("relu") %>% layer_max_pooling_2d(pool_size = c(2,2)) %>% layer_flatten() %>% layer_dense(64) %>% layer_activation("relu") %>% layer_dropout(0.5) %>% layer_dense(1) %>% layer_activation("sigmoid")

To augment data:

augment <- image_data_generator(rescale=1./255, shear_range=0.2, zoom_range=0.2, horizontal_flip=TRUE)

To load a pretrained network:

model_vgg <- application_vgg16(include_top = FALSE, weights = "imagenet")

To save model weights:

save_model_weights_hdf5(model_ft, 'finetuning_30epochs_vggR.h5', overwrite = TRUE)

I believe the Keras for R interface will make it much easier for R users and the R community to build and refine deep learning models with R. This means you don't have to force everyone to use python to build, refine, and test your models. I really think this will open up deep learning to a wider audience that was a bit apprehensive on using Python. So for now, give it a spin!

Grab my github repo, fire up RStudio (or your IDE of choice), and go build a simple classifier using Keras.

Related Post