By Yuri Resende
Today we will use an online convex optimization technique to build a very simple algorithm for portfolio allocation. Of course this is just an illustrative post and we are going to make some simplifying assumptions. The objective is to point out an interesting direction to approach the problem of portfolio allocation.
The algorithm used here is the Online Gradient Descendent (OGD) and we are going to compare the performance of the portfolio with the Uniform Constant Rebalanced Portfolio and the Dow Jones Industrial Average index. You can skip directly to Implementation and Example if you already know what an online algorithm is.
From now on, we will say that represents a point in dimension , where is the number of possible stocks to invest. Each of it’s coordinates represent the amount invested in the respective stock. For example, suppose that and . Then, our portfolio at time $latex t$ are composed of from stock one, from stock two, and from stock three.
Online Convex Optimization is based on the idea of minimizing the so-called Regret function, which is basically an idea of how far from a good player you are (some kind of optimal player that you want to be). In the set-up of portfolio management, the Regret function is defined as:
The first part on the right hand side is the cumulative log return of the portfolio , and the second part, is the cumulative log return of our portfolio until time , which we have played strategy . What is interesting is that is a very special case of portfolio. It represents the Best Constant Rebalanced Portfolio (CRP)…it means that if someone could see the future and knew upfront all the returns of all the stocks and had to play a fixed portfolio proportion at time zero, would be his choice. In the worst case scenario, the best CRP is equal to the best stock in the period, and the best CRP is the portfolio that we want to track.
The OGD algorithm is quite simple, we are going to update our position every week based on the gradient of the function in the currently strategy. Here it follows:
This algorithm is proved to have of order if . There is another similar algorithm which takes second order information and is quite similar to an online version of the Newton’s method. The name, not surprisingly is Online Newton Step. See Agrawal et al (2006) for further details.
In our set-up the set of possible portfolios will not allow us to borrow money or short selling stocks, so every position should be bigger then zero and the sum of all positions should be one. This set is called simplex and define polytope. Basically it’s a triangle in higher dimension.
library(quantmod) symbols &amp;amp;amp;amp;amp;lt;- c('MMM', 'AXP', 'AAPL', 'BA', 'CAT', 'CVX', 'CSCO', 'KO', 'DIS', 'DD', 'XOM', 'GE', 'GS', 'HD', 'IBM', 'INTC', 'JNJ', 'JPM', 'MCD', 'MRK', 'MSFT', 'NKE', 'PFE', 'PG', 'TRV', 'UTX', 'UNH', 'VZ', 'V', 'WMT') for (i in 1:length(symbols)) getSymbols(symbols[i], from = '2007-01-03', to = '2017-06-21') # Building weekly returns for each of the stocks data &amp;amp;amp;amp;amp;lt;-sapply(symbols, function(x) Ad(to.weekly(get(x)))) data &amp;amp;amp;amp;amp;lt;- Reduce(cbind, data) data_returns &amp;amp;amp;amp;amp;lt;- apply(data, 2, function(x) diff(log(x))) #log returns colnames(data_returns) &amp;amp;amp;amp;amp;lt;- symbols data_returns[is.na(data_returns)] &amp;amp;amp;amp;amp;lt;- 0 # VISA hasnt negotiations between 2007 and 2008
The OGD algorithm is implemented next. It uses the quadprogXT
package to solve the projection onto the viable set. We are going to test two different values for to check the algorithm sensibility.
library(quadprogXT) OGD &amp;amp;amp;amp;amp;lt;- function(base, eta) { # Gradient of Regret Function gradient = function(b, p, r) b + r/(p%*%r) # Projection onto viable Set proj = function(p) { Dmat = diag(length(p)) Amat = cbind(diag(rep(1, length(p))), -1) bvec = c(rep(0, length(p)), -1) fit = solveQPXT(Dmat = Dmat, dvec = p, Amat = Amat, bvec = bvec) return(fit$solution) } T = nrow(base) N = ncol(base) r = as.matrix(base) + 1 # this is because the algo doesnt work directly with log returns p = matrix(0, nrow = N, ncol = T); p[,1] = 1/N # initial portfolio b = matrix(0, nrow = N, ncol = T); b[,1] = 0 for (i in 2:T) { b[,i] = gradient(b[,i-1], p[,i-1], r[i-1,]) # calculating gradient p.aux = p[,i-1] + eta*b[,i] # what we would like to play p[,i] = proj(p.aux) # projection in the viable set } return(list('portfolio' = p,'gradient' = b)) } # testing two etas portfolio1 &amp;amp;amp;amp;amp;lt;- OGD(base = data_returns, eta = 1/100) portfolio2 &amp;amp;amp;amp;amp;lt;- OGD(base = data_returns, eta = 1/1000)
Lets build a small function to calculate the compound return of portfolios taking their log return as input.
compound_return &amp;amp;amp;amp;amp;lt;- function(portfolio, r) { return_OGD &amp;amp;amp;amp;amp;lt;- c(); return_OGD[1] = portfolio$portfolio[,1]%*%r[1,] portfolio_value &amp;amp;amp;amp;amp;lt;- c(); portfolio_value[1] = 1 + portfolio$portfolio[,1]%*%r[1,] for (i in 2:nrow(r)) { return_OGD[i] = portfolio$portfolio[,i]%*%r[i,] portfolio_value[i] = portfolio_value[i-1]*(1 + return_OGD[i]) } return(list('value' = portfolio_value, 'return' = return_OGD)) }
Now let’s check the performance of the OGD algorithm with both and values for . We are going to compare the cumulative return first with the Uniform Constant Rebalanced Portfolio and the DJIA index. Them we will check the compound returns of DJIA stocks individually to see if the algorithm tracked the best stocks like we want them to do.
# Dow Jones getSymbols('^DJI', src = 'yahoo', from = '2007-01-03', to = '2017-06-21')
## [1] "DJI"
DJIA_returns = as.numeric(cumprod(weeklyReturn(DJI) + 1)) # Individual stocks stocks_returns = apply(data_returns + 1, 2, cumprod) # Our portfolios portfolio_value1 &amp;amp;amp;amp;amp;lt;- compound_return(portfolio1, data_returns) portfolio_value2 &amp;amp;amp;amp;amp;lt;- compound_return(portfolio2, data_returns)
In the figures below we can see the performance of the strategies that were implemented. Both portfolios had a surprisingly perform, much better than DJIA and UCRP portfolio.
Indeed, if is relatively big, the algorithm quickly found Apple as the best stock, and start pursuing it. If is relative small, them the algorithm takes more time to adapt, but this makes the portfolio more stable.
## &amp;amp;amp;amp;amp;lt;ScaleContinuousPosition&amp;amp;amp;amp;amp;gt; ## Range: ## Limits: 0 -- 1
To finish the comparison between our portfolios and DJIA, lets do a very simple risk analysis with just empirical quantiles using historical results. For a proper risk analysis we should model the conditional volatility of the portfolio, but this will be discussed in another post.
risk_analysis &amp;amp;amp;amp;amp;lt;- function(return) { report = matrix(0, ncol = 2, nrow = 2) report[1,] = quantile(return, probs = c(0.01, 0.05)) report[2,] = c(mean(return[which(return &amp;amp;amp;amp;amp;lt;= report[1,1])]), mean(return[which(return &amp;amp;amp;amp;amp;lt;= report[1,2])])) rownames(report) = c('VaR', 'CVaR') colnames(report) = c('1%', '5%') return(round(report, 3)) } report1 &amp;amp;amp;amp;amp;lt;- risk_analysis(portfolio_value1$return) report2 &amp;amp;amp;amp;amp;lt;- risk_analysis(portfolio_value2$return) report_DJIA &amp;amp;amp;amp;amp;lt;- risk_analysis(weeklyReturn(DJI))
1% | 5% | |
---|---|---|
VaR | -0.091 | -0.047 |
CVaR | -0.119 | -0.074 |
1% | 5% | |
---|---|---|
VaR | -0.057 | -0.039 |
CVaR | -0.085 | -0.055 |
1% | 5% | |
---|---|---|
VaR | -0.060 | -0.040 |
CVaR | -0.084 | -0.055 |
It’s quite evident that setting a big , like we did in portfolio1, may have caused problems with VaR and Conditional VaR since Apple is a volatile stock. For portfolio2, where is small the portfolio takes time to change, the historical VaR and CVaR are very similar to what we already observed in DJIA, but the returns were much better.
Agarwal, Amit, et al. “Algorithms for portfolio management based on the newton method.” Proceedings of the 23rd international conference on Machine learning. ACM, 2006.
nanotime uses the RcppCCTZ package for (efficient) high(er) resolution time parsing and formatting up to nanosecond resolution, and the bit64 package for the actual integer64
arithmetic.
Thanks to a metric ton of work by Leonardo Silvestri, the package now uses S4 classes internally allowing for greater consistency of operations on nanotime objects.
Changes in version 0.2.0 (2017-06-22)
Rewritten in S4 to provide more robust operations (#17 by Leonardo)
Ensure
tz=""
is treated as unset (Leonardo in #20)Added
format
andtz
arguments tonanotime
,format
,Ensure printing respect
options()$max.print
, ensure names are kept with vector (#23 by Leonardo)Correct
summary()
by definingnames<-
(Leonardo in #25 fixing #24)Report error on operations that are meaningful for type; handled NA, NaN, Inf, -Inf correctly (Leonardo in #27 fixing #26)
We also have a diff to the previous version thanks to CRANberries. More details and examples are at the nanotime page; code, issue tickets etc at the GitHub repository.
This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.
]]>A new version of the nanotime package for working with nanosecond timestamps just arrived on CRAN.
nanotime uses the RcppCCTZ package for (efficient) high(er) resolution time parsing and formatting up to nanosecond resolution, and the bit64 package for the actual integer64
arithmetic.
Thanks to a metric ton of work by Leonardo Silvestri, the package now uses S4 classes internally allowing for greater consistency of operations on nanotime objects.
Changes in version 0.2.0 (2017-06-22)
Rewritten in S4 to provide more robust operations (#17 by Leonardo)
Ensure
tz=""
is treated as unset (Leonardo in #20)Added
format
andtz
arguments tonanotime
,format
,Ensure printing respect
options()$max.print
, ensure names are kept with vector (#23 by Leonardo)Correct
summary()
by definingnames<-
(Leonardo in #25 fixing #24)Report error on operations that are meaningful for type; handled NA, NaN, Inf, -Inf correctly (Leonardo in #27 fixing #26)
We also have a diff to the previous version thanks to CRANberries. More details and examples are at the nanotime page; code, issue tickets etc at the GitHub repository.
This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.
Related Post
]]>Among the many R packages, there is the outbreaks package. It contains datasets on epidemics, on of which is from the 2013 outbreak of influenza A H7N9 in China, as analysed by Kucharski et al (2014). I will be using their data as an example to test whether we can use Machine Learning algorithms for predicting disease outcome.
To do so, I selected and extracted features from the raw data, including age, days between onset and outcome, gender, whether the patients were hospitalised, etc. Missing values were imputed and different model algorithms were used to predict outcome (death or recovery). The prediction accuracy, sensitivity and specificity. The thus prepared dataset was devided into training and testing subsets. The test subset contained all cases with an unknown outcome. Before I applied the models to the test data, I further split the training data into validation subsets.
The tested modeling algorithms were similarly successful at predicting the outcomes of the validation data. To decide on final classifications, I compared predictions from all models and defined the outcome “Death” or “Recovery” as a function of all models, whereas classifications with a low prediction probability were flagged as “uncertain”. Accounting for this uncertainty led to a 100% correct classification of the validation test set.
The training cases with unknown outcome were then classified based on the same algorithms. From 57 unknown cases, 14 were classified as “Recovery”, 10 as “Death” and 33 as uncertain.
The dataset contains case ID, date of onset, date of hospitalisation, date of outcome, gender, age, province and of course the outcome: Death or Recovery. I can already see that there are a couple of missing values in the data, which I will deal with later.
# install and load package if (!require("outbreaks")) install.packages("outbreaks") library(outbreaks) fluH7N9.china.2013_backup <- fluH7N9.china.2013 # back up original dataset in case something goes awry along the way # convert ? to NAs fluH7N9.china.2013$age[which(fluH7N9.china.2013$age == "?")] <- NA # create a new column with case ID fluH7N9.china.2013$case.ID <- paste("case", fluH7N9.china.2013$case.ID, sep = "_") head(fluH7N9.china.2013) ## case.ID date.of.onset date.of.hospitalisation date.of.outcome outcome gender age province ## 1 case_1 2013-02-192013-03-04 Death m 87 Shanghai ## 2 case_2 2013-02-27 2013-03-03 2013-03-10 Death m 27 Shanghai ## 3 case_3 2013-03-09 2013-03-19 2013-04-09 Death f 35 Anhui ## 4 case_4 2013-03-19 2013-03-27 f 45 Jiangsu ## 5 case_5 2013-03-19 2013-03-30 2013-05-15 Recover f 48 Jiangsu ## 6 case_6 2013-03-21 2013-03-28 2013-04-26 Death f 32 Jiangsu
Before I start preparing the data for Machine Learning, I want to get an idea of the distribution of the data points and their different variables by plotting. Most provinces have only a handful of cases, so I am combining them into the category “other” and keep only Jiangsu, Shanghai and Zhejian and separate provinces.
# gather for plotting with ggplot2 library(tidyr) fluH7N9.china.2013_gather <- fluH7N9.china.2013 %>% gather(Group, Date, date.of.onset:date.of.outcome) # rearrange group order fluH7N9.china.2013_gather$Group <- factor(fluH7N9.china.2013_gather$Group, levels = c("date.of.onset", "date.of.hospitalisation", "date.of.outcome")) # rename groups library(plyr) fluH7N9.china.2013_gather$Group <- mapvalues(fluH7N9.china.2013_gather$Group, from = c("date.of.onset", "date.of.hospitalisation", "date.of.outcome"), to = c("Date of onset", "Date of hospitalisation", "Date of outcome")) # renaming provinces fluH7N9.china.2013_gather$province <- mapvalues(fluH7N9.china.2013_gather$province, from = c("Anhui", "Beijing", "Fujian", "Guangdong", "Hebei", "Henan", "Hunan", "Jiangxi", "Shandong", "Taiwan"), to = rep("Other", 10)) # add a level for unknown gender levels(fluH7N9.china.2013_gather$gender) <- c(levels(fluH7N9.china.2013_gather$gender), "unknown") fluH7N9.china.2013_gather$gender[is.na(fluH7N9.china.2013_gather$gender)] <- "unknown" # rearrange province order so that Other is the last fluH7N9.china.2013_gather$province <- factor(fluH7N9.china.2013_gather$province, levels = c("Jiangsu", "Shanghai", "Zhejiang", "Other")) # convert age to numeric fluH7N9.china.2013_gather$age <- as.numeric(as.character(fluH7N9.china.2013_gather$age)) library(ggplot2) my_theme <- function(base_size = 12, base_family = "sans"){ theme_minimal(base_size = base_size, base_family = base_family) + theme( axis.text = element_text(size = 12), axis.text.x = element_text(angle = 45, vjust = 0.5, hjust = 0.5), axis.title = element_text(size = 14), panel.grid.major = element_line(color = "grey"), panel.grid.minor = element_blank(), panel.background = element_rect(fill = "aliceblue"), strip.background = element_rect(fill = "lightgrey", color = "grey", size = 1), strip.text = element_text(face = "bold", size = 12, color = "black"), legend.position = "bottom", legend.justification = "top", legend.box = "horizontal", legend.box.background = element_rect(colour = "grey50"), legend.background = element_blank(), panel.border = element_rect(color = "grey", fill = NA, size = 0.5) ) } # plotting raw data ggplot(data = fluH7N9.china.2013_gather, aes(x = Date, y = age, fill = outcome)) + stat_density2d(aes(alpha = ..level..), geom = "polygon") + geom_jitter(aes(color = outcome, shape = gender), size = 1.5) + geom_rug(aes(color = outcome)) + labs( fill = "Outcome", color = "Outcome", alpha = "Level", shape = "Gender", x = "Date in 2013", y = "Age", title = "2013 Influenza A H7N9 cases in China", subtitle = "Dataset from 'outbreaks' package (Kucharski et al. 2014)", caption = "" ) + facet_grid(Group ~ province) + my_theme() + scale_shape_manual(values = c(15, 16, 17)) + scale_color_brewer(palette="Set1", na.value = "grey50") + scale_fill_brewer(palette="Set1")
This plot shows the dates of onset, hospitalisation and outcome (if known) of each data point. Outcome is marked by color and age shown on the y-axis. Gender is marked by point shape. The density distribution of date by age for the cases seems to indicate that older people died more frequently in the Jiangsu and Zhejiang province than in Shanghai and in other provinces. When we look at the distribution of points along the time axis, it suggests that their might be a positive correlation between the likelihood of death and an early onset or early outcome.
I also want to know how many cases there are for each gender and province and compare the genders’ age distribution.
fluH7N9.china.2013_gather_2 <- fluH7N9.china.2013_gather[, -4] %>% gather(group_2, value, gender:province) fluH7N9.china.2013_gather_2$value <- mapvalues(fluH7N9.china.2013_gather_2$value, from = c("m", "f", "unknown", "Other"), to = c("Male", "Female", "Unknown gender", "Other province")) fluH7N9.china.2013_gather_2$value <- factor(fluH7N9.china.2013_gather_2$value, levels = c("Female", "Male", "Unknown gender", "Jiangsu", "Shanghai", "Zhejiang", "Other province")) p1 <- ggplot(data = fluH7N9.china.2013_gather_2, aes(x = value, fill = outcome, color = outcome)) + geom_bar(position = "dodge", alpha = 0.7, size = 1) + my_theme() + scale_fill_brewer(palette="Set1", na.value = "grey50") + scale_color_brewer(palette="Set1", na.value = "grey50") + labs( color = "", fill = "", x = "", y = "Count", title = "2013 Influenza A H7N9 cases in China", subtitle = "Gender and Province numbers of flu cases", caption = "" ) p2 <- ggplot(data = fluH7N9.china.2013_gather, aes(x = age, fill = outcome, color = outcome)) + geom_density(alpha = 0.3, size = 1) + geom_rug() + scale_color_brewer(palette="Set1", na.value = "grey50") + scale_fill_brewer(palette="Set1", na.value = "grey50") + my_theme() + labs( color = "", fill = "", x = "Age", y = "Density", title = "", subtitle = "Age distribution of flu cases", caption = "" ) library(gridExtra) library(grid) grid.arrange(p1, p2, ncol = 2)
In the dataset, there are more male than female cases and correspondingly, we see more deaths, recoveries and unknown outcomes in men than in women. This is potentially a problem later on for modeling because the inherent likelihoods for outcome are not directly comparable between the sexes. Most unknown outcomes were recorded in Zhejiang. Similarly to gender, we don’t have an equal distribution of data points across provinces either. When we look at the age distribution it is obvious that people who died tended to be slightly older than those who recovered. The density curve of unknown outcomes is more similar to that of death than of recovery, suggesting that among these people there might have been more deaths than recoveries. And lastly, I want to plot how many days passed between onset, hospitalisation and outcome for each case.
ggplot(data = fluH7N9.china.2013_gather, aes(x = Date, y = age, color = outcome)) + geom_point(aes(shape = gender), size = 1.5, alpha = 0.6) + geom_path(aes(group = case.ID)) + facet_wrap( ~ province, ncol = 2) + my_theme() + scale_shape_manual(values = c(15, 16, 17)) + scale_color_brewer(palette="Set1", na.value = "grey50") + scale_fill_brewer(palette="Set1") + labs( color = "Outcome", shape = "Gender", x = "Date in 2013", y = "Age", title = "2013 Influenza A H7N9 cases in China", subtitle = "Dataset from 'outbreaks' package (Kucharski et al. 2014)", caption = "\nTime from onset of flu to outcome." )
Gives this plot which shows that there are many missing values in the dates, so it is hard to draw a general conclusion.
In Machine Learning-speak features are the variables used for model training. Using the right features dramatically influences the accuracy of the model.
Because we don’t have many features, I am keeping age as it is, but I am also generating new features:
# preparing the data frame for modeling # library(dplyr) dataset <- fluH7N9.china.2013 %>% mutate(hospital = as.factor(ifelse(is.na(date.of.hospitalisation), 0, 1)), gender_f = as.factor(ifelse(gender == "f", 1, 0)), province_Jiangsu = as.factor(ifelse(province == "Jiangsu", 1, 0)), province_Shanghai = as.factor(ifelse(province == "Shanghai", 1, 0)), province_Zhejiang = as.factor(ifelse(province == "Zhejiang", 1, 0)), province_other = as.factor(ifelse(province == "Zhejiang" | province == "Jiangsu" | province == "Shanghai", 0, 1)), days_onset_to_outcome = as.numeric(as.character(gsub(" days", "", as.Date(as.character(date.of.outcome), format = "%Y-%m-%d") - as.Date(as.character(date.of.onset), format = "%Y-%m-%d")))), days_onset_to_hospital = as.numeric(as.character(gsub(" days", "", as.Date(as.character(date.of.hospitalisation), format = "%Y-%m-%d") - as.Date(as.character(date.of.onset), format = "%Y-%m-%d")))), age = as.numeric(as.character(age)), early_onset = as.factor(ifelse(date.of.onset < summary(fluH7N9.china.2013$date.of.onset)[[3]], 1, 0)), early_outcome = as.factor(ifelse(date.of.outcome < summary(fluH7N9.china.2013$date.of.outcome)[[3]], 1, 0))) %>% subset(select = -c(2:4, 6, 8)) rownames(dataset) <- dataset$case.ID dataset <- dataset[, -1]
When looking at the dataset I created for modeling, it is obvious that we have quite a few missing values. The missing values from the outcome column are what I want to predict but for the rest I would either have to remove the entire row from the data or impute the missing information. I decided to try the latter with the mice package.
# impute missing data library(mice) dataset_impute <- mice(dataset[, -1], print = FALSE) # recombine imputed data frame with the outcome column dataset_complete <- merge(dataset[, 1, drop = FALSE], mice::complete(dataset_impute, 1), by = "row.names", all = TRUE) rownames(dataset_complete) <- dataset_complete$Row.names dataset_complete <- dataset_complete[, -1]
For building the model, I am separating the imputed data frame into training and test data. Test data are the 57 cases with unknown outcome.
summary(dataset$outcome) ## Death Recover NA's ## 32 47 57
The training data will be further devided for validation of the models: 70% of the training data will be kept for model building and the remaining 30% will be used for model testing. I am using the caret package for modeling.
train_index <- which(is.na(dataset_complete$outcome)) train_data <- dataset_complete[-train_index, ] test_data <- dataset_complete[train_index, -1] library(caret) set.seed(27) val_index <- createDataPartition(train_data$outcome, p = 0.7, list=FALSE) val_train_data <- train_data[val_index, ] val_test_data <- train_data[-val_index, ] val_train_X <- val_train_data[,-1] val_test_X <- val_test_data[,-1]
To get an idea about how each feature contributes to the prediction of the outcome, I first built a decision tree based on the training data using rpart and rattle.
library(rpart) library(rattle) library(rpart.plot) library(RColorBrewer) set.seed(27) fit <- rpart(outcome ~ ., data = train_data, method = "class", control = rpart.control(xval = 10, minbucket = 2, cp = 0), parms = list(split = "information")) fancyRpartPlot(fit)
This randomly generated decision tree shows that cases with an early outcome were most likely to die when they were 68 or older, when they also had an early onset and when they were sick for fewer than 13 days. If a person was not among the first cases and was younger than 52, they had a good chance of recovering, but if they were 82 or older, they were more likely to die from the flu.
Not all of the features I created will be equally important to the model. The decision tree already gave me an idea of which features might be most important but I also want to estimate feature importance using a Random Forest approach with repeated cross validation.
# prepare training scheme control <- trainControl(method = "repeatedcv", number = 10, repeats = 10) # train the model set.seed(27) model <- train(outcome ~ ., data = train_data, method = "rf", preProcess = NULL, trControl = control) # estimate variable importance importance <- varImp(model, scale=TRUE) # prepare for plotting importance_df_1 <- importance$importance importance_df_1$group <- rownames(importance_df_1) importance_df_1$group <- mapvalues(importance_df_1$group, from = c("age", "hospital2", "gender_f2", "province_Jiangsu2", "province_Shanghai2", "province_Zhejiang2", "province_other2", "days_onset_to_outcome", "days_onset_to_hospital", "early_onset2", "early_outcome2" ), to = c("Age", "Hospital", "Female", "Jiangsu", "Shanghai", "Zhejiang", "Other province", "Days onset to outcome", "Days onset to hospital", "Early onset", "Early outcome" )) f = importance_df_1[order(importance_df_1$Overall, decreasing = FALSE), "group"] importance_df_2 <- importance_df_1 importance_df_2$Overall <- 0 importance_df <- rbind(importance_df_1, importance_df_2) # setting factor levels importance_df <- within(importance_df, group <- factor(group, levels = f)) importance_df_1 <- within(importance_df_1, group <- factor(group, levels = f)) ggplot() + geom_point(data = importance_df_1, aes(x = Overall, y = group, color = group), size = 2) + geom_path(data = importance_df, aes(x = Overall, y = group, color = group, group = group), size = 1) + scale_color_manual(values = rep(brewer.pal(1, "Set1")[1], 11)) + my_theme() + theme(legend.position = "none", axis.text.x = element_text(angle = 0, vjust = 0.5, hjust = 0.5)) + labs( x = "Importance", y = "", title = "2013 Influenza A H7N9 cases in China", subtitle = "Scaled feature importance", caption = "\nDetermined with Random Forest and repeated cross validation (10 repeats, 10 times)" )
Before I start actually building models, I want to check whether the distribution of feature values is comparable between training, validation and test datasets.
# prepare for plotting dataset_complete_gather <- dataset_complete %>% mutate(set = ifelse(rownames(dataset_complete) %in% rownames(test_data), "Test Data", ifelse(rownames(dataset_complete) %in% rownames(val_train_data), "Validation Train Data", ifelse(rownames(dataset_complete) %in% rownames(val_test_data), "Validation Test Data", "NA"))), case_ID = rownames(.)) %>% gather(group, value, age:early_outcome) dataset_complete_gather$group <- mapvalues(dataset_complete_gather$group, from = c("age", "hospital", "gender_f", "province_Jiangsu", "province_Shanghai", "province_Zhejiang", "province_other", "days_onset_to_outcome", "days_onset_to_hospital", "early_onset", "early_outcome" ), to = c("Age", "Hospital", "Female", "Jiangsu", "Shanghai", "Zhejiang", "Other province", "Days onset to outcome", "Days onset to hospital", "Early onset", "Early outcome" )) ggplot(data = dataset_complete_gather, aes(x = as.numeric(value), fill = outcome, color = outcome)) + geom_density(alpha = 0.2) + geom_rug() + scale_color_brewer(palette="Set1", na.value = "grey50") + scale_fill_brewer(palette="Set1", na.value = "grey50") + my_theme() + facet_wrap(set ~ group, ncol = 11, scales = "free") + labs( x = "Value", y = "Density", title = "2013 Influenza A H7N9 cases in China", subtitle = "Features for classifying outcome", caption = "\nDensity distribution of all features used for classification of flu outcome." )
ggplot(subset(dataset_complete_gather, group == "Age" | group == "Days onset to hospital" | group == "Days onset to outcome"), aes(x=outcome, y=as.numeric(value), fill=set)) + geom_boxplot() + my_theme() + scale_fill_brewer(palette="Set1", type = "div ") + facet_wrap( ~ group, ncol = 3, scales = "free") + labs( fill = "", x = "Outcome", y = "Value", title = "2013 Influenza A H7N9 cases in China", subtitle = "Features for classifying outcome", caption = "\nBoxplot of the features age, days from onset to hospitalisation and days from onset to outcome." )
Gives this plot which luckily, the distributions looks reasonably similar between the validation and test data, except for a few outliers.
Before I try to predict the outcome of the unknown cases, I am testing the models’ accuracy with the validation datasets on a couple of algorithms. I have chosen only a few more well known algorithms, but caret implements many more. I have chose to not do any preprocessing because I was worried that the different data distributions with continuous variables (e.g. age) and binary variables (i.e. 0, 1 classification of e.g. hospitalisation) would lead to problems.
Random Forests predictions are based on the generation of multiple classification trees. This model classified 14 out of 23 cases correctly.
set.seed(27) model_rf <- caret::train(outcome ~ ., data = val_train_data, method = "rf", preProcess = NULL, trControl = trainControl(method = "repeatedcv", number = 10, repeats = 10, verboseIter = FALSE)) confusionMatrix(predict(model_rf, val_test_data[, -1]), val_test_data$outcome)
Lasso or elastic net regularization for generalized linear model regression are based on linear regression models and is useful when we have feature correlation in our model. This model classified 13 out of 23 cases correctly.
set.seed(27) model_glmnet <- caret::train(outcome ~ ., data = val_train_data, method = "glmnet", preProcess = NULL, trControl = trainControl(method = "repeatedcv", number = 10, repeats = 10, verboseIter = FALSE)) confusionMatrix(predict(model_glmnet, val_test_data[, -1]), val_test_data$outcome)
set.seed(27) model_kknn <- caret::train(outcome ~ ., data = val_train_data, method = "kknn", preProcess = NULL, trControl = trainControl(method = "repeatedcv", number = 10, repeats = 10, verboseIter = FALSE)) confusionMatrix(predict(model_kknn, val_test_data[, -1]), val_test_data$outcome)
set.seed(27) model_pda <- caret::train(outcome ~ ., data = val_train_data, method = "pda", preProcess = NULL, trControl = trainControl(method = "repeatedcv", number = 10, repeats = 10, verboseIter = FALSE)) confusionMatrix(predict(model_pda, val_test_data[, -1]), val_test_data$outcome)
Stabilized Linear Discriminant Analysis is designed for high-dimensional data and correlated co-variables. This model classified 15 out of 23 cases correctly.
set.seed(27) model_slda <- caret::train(outcome ~ ., data = val_train_data, method = "slda", preProcess = NULL, trControl = trainControl(method = "repeatedcv", number = 10, repeats = 10, verboseIter = FALSE)) confusionMatrix(predict(model_slda, val_test_data[, -1]), val_test_data$outcome)
Nearest Shrunken Centroids computes a standardized centroid for each class and shrinks each centroid toward the overall centroid for all classes. This model classified 15 out of 23 cases correctly.
set.seed(27) model_pam <- caret::train(outcome ~ ., data = val_train_data, method = "pam", preProcess = NULL, trControl = trainControl(method = "repeatedcv", number = 10, repeats = 10, verboseIter = FALSE)) confusionMatrix(predict(model_pam, val_test_data[, -1]), val_test_data$outcome)
C5.0 is another tree-based modeling algorithm. This model classified 15 out of 23 cases correctly.
set.seed(27) model_C5.0Tree <- caret::train(outcome ~ ., data = val_train_data, method = "C5.0Tree", preProcess = NULL, trControl = trainControl(method = "repeatedcv", number = 10, repeats = 10, verboseIter = FALSE)) confusionMatrix(predict(model_C5.0Tree, val_test_data[, -1]), val_test_data$outcome)
Partial least squares regression combined principal component analysis and multiple regression and is useful for modeling with correlated features. This model classified 15 out of 23 cases correctly.
set.seed(27) model_pls <- caret::train(outcome ~ ., data = val_train_data, method = "pls", preProcess = NULL, trControl = trainControl(method = "repeatedcv", number = 10, repeats = 10, verboseIter = FALSE)) confusionMatrix(predict(model_pls, val_test_data[, -1]), val_test_data$outcome)
All models were similarly accurate.
# Create a list of models models <- list(rf = model_rf, glmnet = model_glmnet, kknn = model_kknn, pda = model_pda, slda = model_slda, pam = model_pam, C5.0Tree = model_C5.0Tree, pls = model_pls) # Resample the models resample_results <- resamples(models) # Generate a summary summary(resample_results, metric = c("Kappa", "Accuracy")) bwplot(resample_results , metric = c("Kappa","Accuracy"))
To compare the predictions from all models, I summed up the prediction probabilities for Death and Recovery from all models and calculated the log2 of the ratio between the summed probabilities for Recovery by the summed probabilities for Death. All cases with a log2 ratio bigger than 1.5 were defined as Recover, cases with a log2 ratio below -1.5 were defined as Death, and the remaining cases were defined as uncertain.
results <- data.frame(randomForest = predict(model_rf, newdata = val_test_data[, -1], type="prob"), glmnet = predict(model_glmnet, newdata = val_test_data[, -1], type="prob"), kknn = predict(model_kknn, newdata = val_test_data[, -1], type="prob"), pda = predict(model_pda, newdata = val_test_data[, -1], type="prob"), slda = predict(model_slda, newdata = val_test_data[, -1], type="prob"), pam = predict(model_pam, newdata = val_test_data[, -1], type="prob"), C5.0Tree = predict(model_C5.0Tree, newdata = val_test_data[, -1], type="prob"), pls = predict(model_pls, newdata = val_test_data[, -1], type="prob")) results$sum_Death <- rowSums(results[, grep("Death", colnames(results))]) results$sum_Recover <- rowSums(results[, grep("Recover", colnames(results))]) results$log2_ratio <- log2(results$sum_Recover/results$sum_Death) results$true_outcome <- val_test_data$outcome results$pred_outcome <- ifelse(results$log2_ratio > 1.5, "Recover", ifelse(results$log2_ratio < -1.5, "Death", "uncertain")) results$prediction <- ifelse(results$pred_outcome == results$true_outcome, "CORRECT", ifelse(results$pred_outcome == "uncertain", "uncertain", "wrong"))
All predictions based on all models were either correct or uncertain.
The above models will now be used to predict the outcome of cases with unknown fate.
set.seed(27) model_rf <- caret::train(outcome ~ ., data = train_data, method = "rf", preProcess = NULL, trControl = trainControl(method = "repeatedcv", number = 10, repeats = 10, verboseIter = FALSE)) model_glmnet <- caret::train(outcome ~ ., data = train_data, method = "glmnet", preProcess = NULL, trControl = trainControl(method = "repeatedcv", number = 10, repeats = 10, verboseIter = FALSE)) model_kknn <- caret::train(outcome ~ ., data = train_data, method = "kknn", preProcess = NULL, trControl = trainControl(method = "repeatedcv", number = 10, repeats = 10, verboseIter = FALSE)) model_pda <- caret::train(outcome ~ ., data = train_data, method = "pda", preProcess = NULL, trControl = trainControl(method = "repeatedcv", number = 10, repeats = 10, verboseIter = FALSE)) model_slda <- caret::train(outcome ~ ., data = train_data, method = "slda", preProcess = NULL, trControl = trainControl(method = "repeatedcv", number = 10, repeats = 10, verboseIter = FALSE)) model_pam <- caret::train(outcome ~ ., data = train_data, method = "pam", preProcess = NULL, trControl = trainControl(method = "repeatedcv", number = 10, repeats = 10, verboseIter = FALSE)) model_C5.0Tree <- caret::train(outcome ~ ., data = train_data, method = "C5.0Tree", preProcess = NULL, trControl = trainControl(method = "repeatedcv", number = 10, repeats = 10, verboseIter = FALSE)) model_pls <- caret::train(outcome ~ ., data = train_data, method = "pls", preProcess = NULL, trControl = trainControl(method = "repeatedcv", number = 10, repeats = 10, verboseIter = FALSE)) models <- list(rf = model_rf, glmnet = model_glmnet, kknn = model_kknn, pda = model_pda, slda = model_slda, pam = model_pam, C5.0Tree = model_C5.0Tree, pls = model_pls) # Resample the models resample_results <- resamples(models) bwplot(resample_results , metric = c("Kappa","Accuracy"))
Here again, the accuracy is similar in all models. The final results are calculate as described above.
results <- data.frame(randomForest = predict(model_rf, newdata = test_data, type="prob"), glmnet = predict(model_glmnet, newdata = test_data, type="prob"), kknn = predict(model_kknn, newdata = test_data, type="prob"), pda = predict(model_pda, newdata = test_data, type="prob"), slda = predict(model_slda, newdata = test_data, type="prob"), pam = predict(model_pam, newdata = test_data, type="prob"), C5.0Tree = predict(model_C5.0Tree, newdata = test_data, type="prob"), pls = predict(model_pls, newdata = test_data, type="prob")) results$sum_Death <- rowSums(results[, grep("Death", colnames(results))]) results$sum_Recover <- rowSums(results[, grep("Recover", colnames(results))]) results$log2_ratio <- log2(results$sum_Recover/results$sum_Death) results$predicted_outcome <- ifelse(results$log2_ratio > 1.5, "Recover", ifelse(results$log2_ratio < -1.5, "Death", "uncertain"))
From 57 cases, 14 were defined as Recover, 10 as Death and 33 as uncertain.
results_combined <- merge(results[, -c(1:16)], fluH7N9.china.2013[which(fluH7N9.china.2013$case.ID %in% rownames(results)), ], by.x = "row.names", by.y = "case.ID") results_combined <- results_combined[, -c(2, 3, 8, 9)] results_combined_gather <- results_combined %>% gather(group_dates, date, date.of.onset:date.of.hospitalisation) results_combined_gather$group_dates <- factor(results_combined_gather$group_dates, levels = c("date.of.onset", "date.of.hospitalisation")) results_combined_gather$group_dates <- mapvalues(results_combined_gather$group_dates, from = c("date.of.onset", "date.of.hospitalisation"), to = c("Date of onset", "Date of hospitalisation")) results_combined_gather$gender <- mapvalues(results_combined_gather$gender, from = c("f", "m"), to = c("Female", "Male")) levels(results_combined_gather$gender) <- c(levels(results_combined_gather$gender), "unknown") results_combined_gather$gender[is.na(results_combined_gather$gender)] <- "unknown" results_combined_gather$age <- as.numeric(as.character(results_combined_gather$age)) ggplot(data = results_combined_gather, aes(x = date, y = log2_ratio, color = predicted_outcome)) + geom_jitter(aes(size = age), alpha = 0.3) + geom_rug() + facet_grid(gender ~ group_dates) + labs( color = "Predicted outcome", size = "Age", x = "Date in 2013", y = "log2 ratio of prediction Recover vs Death", title = "2013 Influenza A H7N9 cases in China", subtitle = "Predicted outcome", caption = "" ) + my_theme() + scale_color_brewer(palette="Set1") + scale_fill_brewer(palette="Set1")
The comparison of date of onset, data of hospitalisation, gender and age with predicted outcome shows that predicted deaths were associated with older age than predicted Recoveries. Date of onset does not show an obvious bias in either direction.
This dataset posed a couple of difficulties to begin with, like unequal distribution of data points across variables and missing data. This makes the modeling inherently prone to flaws. However, real life data isn’t perfect either, so I went ahead and tested the modeling success anyway. By accounting for uncertain classification with low predictions probability, the validation data could be classified accurately. However, for a more accurate model, these few cases don’t give enough information to reliably predict the outcome. More cases, more information (i.e. more features) and fewer missing data would improve the modeling outcome. Also, this example is only applicable for this specific case of flu. In order to be able to draw more general conclusions about flu outcome, other cases and additional information, for example on medical parameters like preexisting medical conditions, disase parameters, demographic information, etc. would be necessary. All in all, this dataset served as a nice example of the possibilities (and pitfalls) of machine learning applications and showcases a basic workflow for building prediction models with R.
If you have questions feel free to post a comment.
Related Post
Excited to annonunce a new package called charlatan
. While perusing
packages from other programming languages, I saw a neat Python library
called faker
.
charlatan
is inspired from and ports many things from Python's
https://github.com/joke2k/faker library. In turn, faker
was inspired from
PHP's faker,
Perl's Faker, and
Ruby's faker. It appears that the PHP
library was the original - nice work PHP.
What could you do with this package? Here's some use cases:
charlatan
is language support.
Of course for some data types (numbers), languages don't come into play, but
for many they do. That means you can create fake datasets specific to a
language, or a dataset with a mix of languages, etc. For the variables
in this package, we have not yet ported over all languages for those
variables that Python's faker
has.We have not ported every variable, or every language yet in those variables.
We have added some variables to charlatan
that are not in faker
(e.g.,
taxonomy, gene sequences). Check out the issues
to follow progress.
ch_generate
: generate a data.frame with fake datafraudster
: single interface to all fake data methodsch_
that wrap low level interfaces, and are meant to be easier
to use and provide easy way to make many instances of a thing.Check out the package vignette to get started.
Install charlatan
install.packages("charlatan")
Or get the development version:
devtools::install_github("ropensci/charlatan")
library(charlatan)
fraudster
is an interface for all fake data variables (and locales):
x <- fraudster()
x$job()
#> [1] "Textile designer"
x$name()
#> [1] "Cris Johnston-Tremblay"
x$job()
#> [1] "Database administrator"
x$color_name()
#> [1] "SaddleBrown"
If you want to set locale, do so like fraudster(locale = "{locale}")
The locales that are supported vary by data variable. We're adding more locales through time, so do check in from time to time - or even better, send a pull request adding support for the locale you want for the variable(s) you want.
As an example, you can set locale for job data to any number of supported locales.
For jobs:
ch_job(locale = "en_US", n = 3)
#> [1] "Charity officer" "Financial adviser" "Buyer, industrial"
ch_job(locale = "fr_FR", n = 3)
#> [1] "Illustrateur" "Guichetier"
#> [3] "Responsable d'ordonnancement"
ch_job(locale = "hr_HR", n = 3)
#> [1] "Pomoćnik strojovođe"
#> [2] "Pećar"
#> [3] "Konzervator – restaurator savjetnik"
ch_job(locale = "uk_UA", n = 3)
#> [1] "Фрілансер" "Астрофізик" "Доцент"
ch_job(locale = "zh_TW", n = 3)
#> [1] "包裝設計" "空調冷凍技術人員" "鍋爐操作技術人員"
For colors:
ch_color_name(locale = "en_US", n = 3)
#> [1] "DarkMagenta" "Navy" "LightGray"
ch_color_name(locale = "uk_UA", n = 3)
#> [1] "Синій ВПС" "Темно-зелений хакі" "Берлінська лазур"
charlatan
will tell you when a locale is not supported
ch_job(locale = "cv_MN")
#> Error: cv_MN not in set of available locales
ch_generate()
helps you create data.frame's with whatever variables
you want that charlatan
supports. Then you're ready to use the
data.frame immediately in whatever your application is.
By default, you get back a certain set of variables. Right now, that is:
name
, job
, and phone_number
.
ch_generate()
#> # A tibble: 10 x 3
#> name job
#> <chr> <chr>
#> 1 Coy Davis Geneticist, molecular
#> 2 Artis Senger Press sub
#> 3 Tal Rogahn Town planner
#> 4 Nikolas Carter Barrister's clerk
#> 5 Sharlene Kemmer Insurance account manager
#> 6 Babyboy Volkman Quality manager
#> 7 Dr. Josephus Marquardt DVM Best boy
#> 8 Vernal Dare Engineer, site
#> 9 Emilia Hessel Administrator, arts
#> 10 Urijah Beatty Editor, commissioning
#> # ... with 1 more variables: phone_number <chr>
You can select just the variables you want:
ch_generate('job', 'phone_number', n = 30)
#> # A tibble: 30 x 2
#> job phone_number
#> <chr> <chr>
#> 1 Call centre manager 1-670-715-3079x9104
#> 2 Nurse, learning disability 1-502-781-3386x33524
#> 3 Network engineer 1-692-089-3060
#> 4 Industrial buyer 1-517-855-8517
#> 5 Database administrator (999)474-9975x89650
#> 6 Operations geologist 06150655769
#> 7 Engineer, land 360-043-3630x592
#> 8 Pension scheme manager (374)429-6821
#> 9 Personnel officer 1-189-574-3348x338
#> 10 Editor, film/video 1-698-135-1664
#> # ... with 20 more rows
A sampling of the data types available in charlatan
:
person name
ch_name()
#> [1] "Jefferey West-O'Connell"
ch_name(10)
#> [1] "Dylon Hintz" "Dr. Billy Willms DDS" "Captain Bednar III"
#> [4] "Carli Torp" "Price Strosin III" "Grady Mayert"
#> [7] "Nat Herman-Kuvalis" "Noelle Funk" "Dr. Jaycie Herzog MD"
#> [10] "Ms. Andrea Zemlak"
phone number
ch_phone_number()
#> [1] "643.993.1958"
ch_phone_number(10)
#> [1] "+06(6)6080789632" "05108334280" "447-126-9775"
#> [4] "+96(7)2112213020" "495-425-1506" "1-210-372-3188x514"
#> [7] "(300)951-5115" "680.567.5321" "1-947-805-4758x8167"
#> [10] "888-998-5511x554"
job
ch_job()
#> [1] "Scientist, water quality"
ch_job(10)
#> [1] "Engineer, production"
#> [2] "Architect"
#> [3] "Exhibitions officer, museum/gallery"
#> [4] "Patent attorney"
#> [5] "Surveyor, minerals"
#> [6] "Electronics engineer"
#> [7] "Secondary school teacher"
#> [8] "Intelligence analyst"
#> [9] "Nutritional therapist"
#> [10] "Information officer"
Real data is messy! charlatan
makes it easy to create
messy data. This is still in the early stages so is not available
across most data types and languages, but we're working on it.
For example, create messy names:
ch_name(50, messy = TRUE)
#> [1] "Mr. Vernell Hoppe Jr." "Annika Considine d.d.s."
#> [3] "Dr. Jose Kunde DDS" "Karol Leuschke-Runte"
#> [5] "Kayleen Kutch-Hintz" "Jahir Green"
#> [7] "Stuart Emmerich" "Hillard Schaden"
#> [9] "Mr. Caden Braun" "Willie Ebert"
#> [11] "Meg Abbott PhD" "Dr Rahn Huel"
#> [13] "Kristina Crooks d.d.s." "Lizbeth Hansen"
#> [15] "Mrs. Peyton Kuhn" "Hayley Bernier"
#> [17] "Dr. Lavon Schimmel d.d.s." "Iridian Murray"
#> [19] "Cary Romaguera" "Tristan Windler"
#> [21] "Marlana Schroeder md" "Mr. Treyton Nitzsche"
#> [23] "Hilmer Nitzsche-Glover" "Marius Dietrich md"
#> [25] "Len Mertz" "Mrs Adyson Wunsch DVM"
#> [27] "Dr. Clytie Feest DDS" "Mr. Wong Lebsack I"
#> [29] "Arland Kessler" "Mrs Billy O'Connell m.d."
#> [31] "Stephen Gerlach" "Jolette Lueilwitz"
#> [33] "Mrs Torie Green d.d.s." "Mona Denesik"
#> [35] "Mitchell Auer" "Miss. Fae Price m.d."
#> [37] "Todd Lehner" "Elva Lesch"
#> [39] "Miss. Gustie Rempel DVM" "Lexie Parisian-Stark"
#> [41] "Beaulah Cremin-Rice" "Parrish Schinner"
#> [43] "Latrell Beier" "Garry Wolff Sr"
#> [45] "Bernhard Vandervort" "Stevie Johnston"
#> [47] "Dawson Gaylord" "Ivie Labadie"
#> [49] "Ronal Parker" "Mr Willy O'Conner Sr."
Right now only suffixes and prefixes for names in en_US
locale
are supported. Notice above some variation in prefixes and suffixes.
We have lots ot do still. Some of those things include:
faker
has the data, but we need to port it over
still.faker
.
In addition, we may find inspiration from faker libraries in other
programming languages.Excited to annonunce a new package called charlatan
. While perusing
packages from other programming languages, I saw a neat Python library
called faker
.
charlatan
is inspired from and ports many things from Python's
https://github.com/joke2k/faker library. In turn, faker
was inspired from
PHP's faker,
Perl's Faker, and
Ruby's faker. It appears that the PHP
library was the original – nice work PHP.
What could you do with this package? Here's some use cases:
charlatan
is language support.faker
has.We have not ported every variable, or every language yet in those variables.
We have added some variables to charlatan
that are not in faker
(e.g.,
taxonomy, gene sequences). Check out the issues
to follow progress.
ch_generate
: generate a data.frame with fake datafraudster
: single interface to all fake data methodsch_
that wrap low level interfaces, and are meant to be easierCheck out the package vignette to get started.
Install charlatan
install.packages("charlatan")
Or get the development version:
devtools::install_github("ropensci/charlatan")
library(charlatan)
fraudster
is an interface for all fake data variables (and locales):
x <- fraudster()
x$job()
#> [1] "Textile designer"
x$name()
#> [1] "Cris Johnston-Tremblay"
x$job()
#> [1] "Database administrator"
x$color_name()
#> [1] "SaddleBrown"
If you want to set locale, do so like fraudster(locale = "{locale}")
The locales that are supported vary by data variable. We're adding more
locales through time, so do check in from time to time – or even better,
send a pull request adding support for the locale you want for the
variable(s) you want.
As an example, you can set locale for job data to any number of supported
locales.
For jobs:
ch_job(locale = "en_US", n = 3)
#> [1] "Charity officer" "Financial adviser" "Buyer, industrial"
ch_job(locale = "fr_FR", n = 3)
#> [1] "Illustrateur" "Guichetier"
#> [3] "Responsable d'ordonnancement"
ch_job(locale = "hr_HR", n = 3)
#> [1] "Pomoćnik strojovođe"
#> [2] "Pećar"
#> [3] "Konzervator – restaurator savjetnik"
ch_job(locale = "uk_UA", n = 3)
#> [1] "Фрілансер" "Астрофізик" "Доцент"
ch_job(locale = "zh_TW", n = 3)
#> [1] "包裝設計" "空調冷凍技術人員" "鍋爐操作技術人員"
For colors:
ch_color_name(locale = "en_US", n = 3)
#> [1] "DarkMagenta" "Navy" "LightGray"
ch_color_name(locale = "uk_UA", n = 3)
#> [1] "Синій ВПС" "Темно-зелений хакі" "Берлінська лазур"
charlatan
will tell you when a locale is not supported
ch_job(locale = "cv_MN")
#> Error: cv_MN not in set of available locales
ch_generate()
helps you create data.frame's with whatever variables
you want that charlatan
supports. Then you're ready to use the
data.frame immediately in whatever your application is.
By default, you get back a certain set of variables. Right now, that is:
name
, job
, and phone_number
.
ch_generate()
#> # A tibble: 10 x 3
#> name job
#>
#> 1 Coy Davis Geneticist, molecular
#> 2 Artis Senger Press sub
#> 3 Tal Rogahn Town planner
#> 4 Nikolas Carter Barrister's clerk
#> 5 Sharlene Kemmer Insurance account manager
#> 6 Babyboy Volkman Quality manager
#> 7 Dr. Josephus Marquardt DVM Best boy
#> 8 Vernal Dare Engineer, site
#> 9 Emilia Hessel Administrator, arts
#> 10 Urijah Beatty Editor, commissioning
#> # ... with 1 more variables: phone_number
You can select just the variables you want:
ch_generate('job', 'phone_number', n = 30)
#> # A tibble: 30 x 2
#> job phone_number
#>
#> 1 Call centre manager 1-670-715-3079x9104
#> 2 Nurse, learning disability 1-502-781-3386x33524
#> 3 Network engineer 1-692-089-3060
#> 4 Industrial buyer 1-517-855-8517
#> 5 Database administrator (999)474-9975x89650
#> 6 Operations geologist 06150655769
#> 7 Engineer, land 360-043-3630x592
#> 8 Pension scheme manager (374)429-6821
#> 9 Personnel officer 1-189-574-3348x338
#> 10 Editor, film/video 1-698-135-1664
#> # ... with 20 more rows
A sampling of the data types available in charlatan
:
person name
ch_name()
#> [1] "Jefferey West-O'Connell"
ch_name(10)
#> [1] "Dylon Hintz" "Dr. Billy Willms DDS" "Captain Bednar III"
#> [4] "Carli Torp" "Price Strosin III" "Grady Mayert"
#> [7] "Nat Herman-Kuvalis" "Noelle Funk" "Dr. Jaycie Herzog MD"
#> [10] "Ms. Andrea Zemlak"
phone number
ch_phone_number()
#> [1] "643.993.1958"
ch_phone_number(10)
#> [1] "+06(6)6080789632" "05108334280" "447-126-9775"
#> [4] "+96(7)2112213020" "495-425-1506" "1-210-372-3188x514"
#> [7] "(300)951-5115" "680.567.5321" "1-947-805-4758x8167"
#> [10] "888-998-5511x554"
job
ch_job()
#> [1] "Scientist, water quality"
ch_job(10)
#> [1] "Engineer, production"
#> [2] "Architect"
#> [3] "Exhibitions officer, museum/gallery"
#> [4] "Patent attorney"
#> [5] "Surveyor, minerals"
#> [6] "Electronics engineer"
#> [7] "Secondary school teacher"
#> [8] "Intelligence analyst"
#> [9] "Nutritional therapist"
#> [10] "Information officer"
Real data is messy! charlatan
makes it easy to create
messy data. This is still in the early stages so is not available
across most data types and languages, but we're working on it.
For example, create messy names:
ch_name(50, messy = TRUE)
#> [1] "Mr. Vernell Hoppe Jr." "Annika Considine d.d.s."
#> [3] "Dr. Jose Kunde DDS" "Karol Leuschke-Runte"
#> [5] "Kayleen Kutch-Hintz" "Jahir Green"
#> [7] "Stuart Emmerich" "Hillard Schaden"
#> [9] "Mr. Caden Braun" "Willie Ebert"
#> [11] "Meg Abbott PhD" "Dr Rahn Huel"
#> [13] "Kristina Crooks d.d.s." "Lizbeth Hansen"
#> [15] "Mrs. Peyton Kuhn" "Hayley Bernier"
#> [17] "Dr. Lavon Schimmel d.d.s." "Iridian Murray"
#> [19] "Cary Romaguera" "Tristan Windler"
#> [21] "Marlana Schroeder md" "Mr. Treyton Nitzsche"
#> [23] "Hilmer Nitzsche-Glover" "Marius Dietrich md"
#> [25] "Len Mertz" "Mrs Adyson Wunsch DVM"
#> [27] "Dr. Clytie Feest DDS" "Mr. Wong Lebsack I"
#> [29] "Arland Kessler" "Mrs Billy O'Connell m.d."
#> [31] "Stephen Gerlach" "Jolette Lueilwitz"
#> [33] "Mrs Torie Green d.d.s." "Mona Denesik"
#> [35] "Mitchell Auer" "Miss. Fae Price m.d."
#> [37] "Todd Lehner" "Elva Lesch"
#> [39] "Miss. Gustie Rempel DVM" "Lexie Parisian-Stark"
#> [41] "Beaulah Cremin-Rice" "Parrish Schinner"
#> [43] "Latrell Beier" "Garry Wolff Sr"
#> [45] "Bernhard Vandervort" "Stevie Johnston"
#> [47] "Dawson Gaylord" "Ivie Labadie"
#> [49] "Ronal Parker" "Mr Willy O'Conner Sr."
Right now only suffixes and prefixes for names in en_US
locale
are supported. Notice above some variation in prefixes and suffixes.
We have lots ot do still. Some of those things include:
faker
has the data, but we need to port it overfaker
.A new minor version 0.2.3 of RcppCCTZ is now on CRAN.
RcppCCTZ uses Rcpp to bring CCTZ to R. CCTZ is a C++ library for translating between absolute and civil times using the rules of a time zone. In fact, it is two libraries. One for dealing with civil time: human-readable dates and times, and one for converting between between absolute and civil times via time zones. The RcppCCTZ page has a few usage examples and details.
This version ensures that we set the TZDIR
environment variable correctly on the old dreaded OS that does not come with proper timezone information—an issue which had come up while preparing the next (and awesome, trust me) release of nanotime. It also appears that I failed to blog about 0.2.2, another maintenance release, so changes for both are summarised next.
Changes in version 0.2.3 (2017-06-19)
On Windows, the
TZDIR
environment variable is now set in.onLoad()
Replaced
init.c
with registration code inside ofRcppExports.cpp
thanks to Rcpp 0.12.11.Changes in version 0.2.2 (2017-04-20)
Synchronized with upstream CCTZ
The
time_point
object is instantiated explicitly for nanosecond use which appears to be required on macOS
We also have a diff to the previous version thanks to CRANberries. More details are at the RcppCCTZ page; code, issue tickets etc at the GitHub repository.
This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.
The post Large integers in R: Fibonacci number with 1000 digits, Euler Problem 25 appeared first on The Devil is in the Data.
]]>Euler Problem 25 takes us back to the Fibonacci sequence and the problems related to working with very large integers.
The Fibonacci sequence follows a simple mathematical rule but it can create things of great beauty. This pattern occurs quite often in nature, like to nautilus shell shown in the image. The video by Arthur Benjamin at the end of this post illustrates some of the magic of this sequence.
By default, numbers with more than 7 digits are shown in scientific notation in R, which reduces the accuracy of the calculation. You can change the precision of large integers with the options function but R struggles with integers with more than 22 digits. This example illustrates this issue.
factorial(24) [1] 6.204484e+23 > options(digits=22) > factorial(24) [1] 620448401733239409999872
However, finding a number of 1000 digits is a problem with using special functions.
The Fibonacci sequence is defined by the recurrence relation:
, where and . The 12^{th} term, , is the first term to contain three digits.
What is the index of the first term in the Fibonacci sequence to contain 1000 digits?
The first solution uses the GMP library to manage very large integers. This library also contains a function to generate Fibonacci numbers. This solution cycles through the Fibonacci sequence until it finds a number with 1000 digits.
library(gmp) # GNU Multiple Precision Arithmetic Library n <- 1 fib <- 1 while (nchar(as.character(fib)) < 1000) { fib <- fibnum(n) # Determine next Fibonacci number n <- n + 1 } answer <- n print(answer)
This is a very fast solution but my aim is to solve the first 100 Project Euler problems using only base-R code. The big.add function I developed to solve Euler Problem 13.
t <- proc.time() fib <- 1 # First Fibonaci number cur <- 1 # Current number in sequence pre <- 1 # Previous number in sequence index <- 2 while (nchar(fib) < 1000) { fib <- big.add(cur, pre) # Determine next Fibonacci number pre <- cur cur <- fib index <- index + 1 } answer <- index print(answer)
This code is much slower than the GMP library but is was fun to develop.
The post Large integers in R: Fibonacci number with 1000 digits, Euler Problem 25 appeared first on The Devil is in the Data.
The Windows edition of the Data Science Virtual Machine (DSVM), the all-in-one virtual machine image with a wide-collection of open-source and Microsoft data science tools, has been updated to the Windows Server 2016 platform. This update brings built-in support for Docker containers and GPU-based deep learning.
GPU-based Deep Learning. While prior editions of the DSVM could access GPU-based capabilities by installing additional components, everything is now configured and ready at launch. The DSVM now includes GPU-enabled builds of popular deep learning frameworks including CNTK, Tensorflow, and MXNET. It also includes Microsoft R Server 9.1, and several machine-learning functions in the MicrosoftML package can also take advantage of GPUs. Note that you will need to use an N-series Azure instance to benefit from GPU acceleration, but all of the tools in the DSVM will also work on regular CPU-based instances as well.
Docker Containers. Windows Server 2016 includes Windows Containers, and the DSVM also includes the Docker Engine. This means you can easily run Windows containers from Docker Hub or the Azure Container registry. This is limited to Windows-based containers right now, but you can use Linux-based containers on the Data Science Virtual Machine for Ubuntu Linux, which likewise supports GPU-enabled deep learning.
It's easy to launch a Windows 2016 DSVM on Azure, and the documentation will help you get started using it. (An Azure account is required, but free Azure credits are available to help get you started.) You can find more information about this update at the link below.
Cortana Intelligence and Machine Learning Blog: Introducing the new Data Science Virtual Machine on Windows Server 2016
Neural network have become a corner stone of machine learning in the last decade. Created in the late 1940s with the intention to create computer programs who mimics the way neurons process information, those kinds of algorithm have long been believe to be only an academic curiosity, deprived of practical use since they require a lot of processing power and other machine learning algorithm outperform them. However since the mid 2000s, the creation of new neural network types and techniques, couple with the increase availability of fast computers made the neural network a powerful tool that every data analysts or programmer must know.
In this series of articles, we’ll see how to fit a neural network with R, we’ll learn the core concepts we need to know to well apply those algorithms and how to evaluate if our model is appropriate to use in production. In the last exercises sets, we have seen how to implement a feed-forward neural network in R. That kind of neural network is quite useful to match a single input value to a specific output value, either a dependent variable in regression problems or a class in clustering problems. However sometime, a sequence of input can give a lot more of information to the network than a single value. For example, if you want to train a neural network to predict which letter will come next in a word based on which letters have been typed, making prediction based on the last letter entered can give good results, but if all the previous letter are used for making the predictions the results should be better since the arrangement of previous letter can give important information about the rest of the word.
In today’s exercise set, we will see a type of neural network that is design to make use of the information made available by using sequence of inputs. Those ”recurrent neural networks” do so by using a hidden state at time t-1 that influence the calculation of the weight at time t. For more information about this type of neural network, you can read this article which is a good introduction on the subject.
Answers to the exercises are available here.
Exercise 1
We will start by using a recurrent neural network to predict the values of a time series. Load the tsEuStockMarkets
dataset from the dataset
package and save the first 1400 observations from the “DAX” time series as your working dataset.
Exercise 2
Process the dataset so he can be used in a neural network.
Exercise 3
Create two matrix containing 10 sequences of 140 observations from the previous dataset. The first one must be made of the original observations and will be the input of our neural network. The second one will be the output and since we want to predict the value of the stock market at time t+1 based on the value at time t, this matrix will be the same as the first one were all the elements are shifted from one position. Make sure that each sequence are coded as a row of each matrix.
Exercise 4
Set the seed to 42 and choose randomly eight sequences to train your model and two sequences that will be used for validation later. Once it’s done, load the rnn
package and use the trainr()
function to train a recurrent neural network on the training dataset. For now, use a learning rate of 0.01, one hidden layer of one neuron and 500 epoch.
Exercise 5
Use the function predictr
to make prediction on all the 10 sequences of your original data matrix, then plot the real values and the predicted value on the same graph. Also draw the plot of the prediction on the test set and the real value of your dataset.
Exercise 6
The last model seems to underestimate the stock values that are higher than 0.5. Repeat the step of exercise 3 and 4 but this time use 10 hidden layers. Once it’s done calculate the RMSE of your predictions. This will be the baseline model for the rest of this exercise set.
Exercise 7
One interesting method often used to accelerate the training of a neural network is the “Nesterov momentum”. This procedure is based on the fact that while trying to find the weights that minimize the cost function of your neural network, optimization algorithm like gradient descend “zigzag” around a straight path to the minimum value. By adding a momentum matrix, which keeps track of the general direction of the gradient, to the gradient we can minimize the deviation from this optimal path and speeding the convergence of the algorithm. You can see this video for more information about this concept.
Repeat the last exercise, but this time use 250 epochs and a momentum of 0.7.
Exercise 8
As special type of recurrent neural network trained by backpropagation through time is called the Long Short-Term Memory (LSTM) network. This type of recurrent neural network is quite useful in a deep learning context, since this method is robust again the vanishing gradient problem. We will see both of those concepts more in detail in a future exercise set, but for now you can read about it here.
The trainr()
function give us the ability to train a LSTM network by setting the network_type
parameter to “lstm”. Use this algorithm with 500 epochs and 20 neuron in the hidden layer to predict the value of your time series.
Exercise 9
When working with a recurrent neural network it is important to choose an input sequence length that give the algorithm the maximum information possible without adding useless noise to the input. Until now we used 10 sequences of 140 observations. Train a recurrent neural network on 28 sequences of 50 observations, make prediction and compute the RMSE to see if this encoding had an effect on your predictions.
Exercise 10
Try to use all of the 1860 observation in the “DAX” time series to train and test a recurrent neural network. Then post the setting you used for your model and why you choose them in the comments.