(This article was first published on ** S+/R – Yet Another Blog in Statistical Computing**, and kindly contributed to R-bloggers)

In R, there are a couple ways to convert the column-oriented data frame to a row-oriented dictionary list or alike, e.g. a list of lists.

In the code snippet below, I would show each approach and how to extract keys and values from the dictionary. As shown in the benchmark, it appears that the generic R data structure is still the most efficient.

### LIST() FUNCTION IN BASE PACKAGE ### x1 <- as.list(iris[1, ]) names(x1) # [1] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width" "Species" x1[["Sepal.Length"]] # [1] 5.1 ### ENVIRONMENT-BASED SOLUTION ### envn_dict <- function(x) { e <- new.env(hash = TRUE) for (name in names(x)) assign(name, x[, name], e) return(e) } x2 <- envn_dict(iris[1, ]) ls(x2) # [1] "Petal.Length" "Petal.Width" "Sepal.Length" "Sepal.Width" "Species" x2[["Sepal.Length"]] # [1] 5.1 ### COLLECTIONS PACKAGE ### coll_dict <- function(x) { d <- collections::Dict$new() for (name in names(x)) d$set(name, x[, name]) return(d) } x3 <- coll_dict(iris[1, ]) x3$keys() # [1] "Petal.Length" "Petal.Width" "Sepal.Length" "Sepal.Width" "Species" x3$get("Sepal.Length") # [1] 5.1 ### HASH PACKAGE ### hash_dict <- function(x) { d <- hash::hash() for (name in names(x)) d[[name]] <- x[, name] return(d) } x4 <- hash_dict(iris[1, ]) hash::keys(x4) # [1] "Petal.Length" "Petal.Width" "Sepal.Length" "Sepal.Width" "Species" hash::values(x4, "Sepal.Length") # Sepal.Length # 5.1 ### DATASTRUCTURES PACKAGE ### data_dict <- function(x) { d <- datastructures::hashmap() for (name in names(x)) d[name] <- x[, name] return(d) } x5 <- data_dict(iris[1, ]) datastructures::keys(x5) # [1] "Species" "Sepal.Width" "Petal.Length" "Sepal.Length" "Petal.Width" datastructures::get(x5, "Sepal.Length") # [1] 5.1 ### FROM PYTHON ### py2r_dict <- function(x) { return(reticulate::py_dict(names(x), x, TRUE)) } x6 <- py2r_dict(iris[1, ]) x6$keys() # [1] "Petal.Length" "Sepal.Length" "Petal.Width" "Sepal.Width" "Species" x6["Sepal.Length"] # [1] 5.1 ### CONVERT DATAFRAME TO DICTIONARY LIST ### to_list <- function(df, fn) { l <- list() for (i in seq(nrow(df))) l[[i]] <- fn(df[i, ]) return(l) } rbenchmark::benchmark(replications = 100, order = "elapsed", relative = "elapsed", columns = c("test", "replications", "elapsed", "relative", "user.self", "sys.self"), "BASE::LIST" = to_list(iris, as.list), "BASE::ENVIRONMENT" = to_list(iris, envn_dict), "COLLECTIONS::DICT" = to_list(iris, coll_dict), "HASH::HASH" = to_list(iris, hash_dict), "DATASTRUCTURES::HASHMAP" = to_list(iris, data_dict), "RETICULATE::PY_DICT" = to_list(iris, py2r_dict) ) # test replications elapsed relative user.self sys.self #1 BASE::LIST 100 0.857 1.000 0.857 0.000 #2 BASE::ENVIRONMENT 100 1.607 1.875 1.607 0.000 #4 HASH::HASH 100 2.600 3.034 2.600 0.000 #3 COLLECTIONS::DICT 100 2.956 3.449 2.956 0.000 #5 DATASTRUCTURES::HASHMAP 100 16.070 18.751 16.071 0.000 #6 RETICULATE::PY_DICT 100 18.030 21.039 18.023 0.008

To **leave a comment** for the author, please follow the link and comment on their blog: ** S+/R – Yet Another Blog in Statistical Computing**.

R-bloggers.com offers

(This article was first published on ** Thinking inside the box **, and kindly contributed to R-bloggers)

A second and minor update for the RcppGetconf package for reading system configuration — not unlike `getconf`

from the libc library — is now on CRAN.

Changes are minor. We avoid an error on a long-dead operating system cherished in one particular corner of the CRAN world. In doing so some files were updated so that dynamically loaded routines are now registered too.

The short list of changes in this release follows:

## Changes in inline version 0.0.3 (2018-11-16)

- Examples no longer run on Solaris where they appear to fail.

Courtesy of CRANberries, there is a diffstat report. More about the package is at the local RcppGetconf page and the GitHub repo.

This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

To **leave a comment** for the author, please follow the link and comment on their blog: ** Thinking inside the box **.

R-bloggers.com offers

(This article was first published on ** Mad (Data) Scientist**, and kindly contributed to R-bloggers)

I occasionally see queries on various social media as to overfitting — what is it?, etc. I’ll post an example here. (I mentioned it at my talk the other night on our novel approach to missing values, but had a bug in the code. Here is the correct account.)

The dataset is **prgeng**, on wages of programmers and engineers in Silicon Valley as of the 2000 Census. It’s included in our **polyreg** package, which we developed as an alternative to neural networks. But it is quite useful in its own right, as it makes it very convenient to fit multivariate polynomial models. (Not as easy as it sounds; e.g. one must avoid cross products of orthogonal dummy variables, powers of those variables, etc.)

First I split the data into training and test sets:

> set.seed(88888) > getPE() > pe1 <- pe[,c(1,2,4,6,7,12:16,3)] > testidxs <- sample(1:nrow(pe1),1000) > testset <- pe1[testidxs,] > trainset <- pe1[-testidxs,]

As a base, I fit an ordinary degree-1 model and found the mean absolute prediction error:

> lmout <- lm(wageinc ~ .,data=trainset) > predvals <- predict(lmout,testset[,-11]) > mean(abs(predvals - testset[,11])) [1] 25456.98

Next, I tried a quadratic model:

> pfout <- polyFit(trainset,deg=2) > mean(abs(predict(pfout,testset[,-11]) - testset[,11])) [1] 24249.83

Note that, though originally there were 10 predictors, now with polynomial terms we have 46.

I kept going:

deg | MAPE | # terms |

3 | 23951.69 | 118 |

4 | 23974.76 | 226 |

5 | 24340.85 | 371 |

6 | 24554.34 | 551 |

7 | 36463.61 | 767 |

8 | 74296.09 | 1019 |

One must keep in mind the effect of sampling variation, and repeated trials would be useful here, but it seems that the data can stand at least a cubic fit and possibly as much as degree 5 or even 6. To be conservative, it would seem wise to stop at degree 3. That’s also consistent with the old Tukey rule of thumb that we should have p <- sqrt(n), n being about 20,000 here.

In any event, the effects of overfitting here are dramatic, starting at degree 7.

It should be noted that I made no attempt to clean the data, nor to truncate predicted values at 0, etc.

It should also be noted that, starting at degree 4, R emitted warnings, “prediction from a rank-deficient fit may be misleading.” It is well known that at high degrees, polynomial terms can have multicollinearity issues.

Indeed, this is a major point we make in our arXiv paper cited above. There we argue that neural networks are polynomial models in disguise, with the effective degree of the polynomial increasing a lot at each successive layer, and thus multicollinearity increasing from layer to layer. We’ve confirmed this empirically. We surmise that this is a major cause of convergence problems in NNs.

Finally, whenever I talk about polynomials and NNs, I hear the standard (and correct) concern that polynomial grow rapidly at the edges of the data. True, but I would point out that if you accept NNs = polynomials, then the same is true for NNs.

We’re still working on the polynomials/NNs project. More developments to be announced soon. But for those who are wondering about overfitting, the data here should make the point.

To **leave a comment** for the author, please follow the link and comment on their blog: ** Mad (Data) Scientist**.

R-bloggers.com offers

(This article was first published on ** poissonisfish**, and kindly contributed to R-bloggers)

Among most popular off-the-shelf machine learning packages available to R, `caret`

ought to stand out for its consistency. It reaches out to a wide range of dependencies that deploy and support model building using a uniform, simple syntax. I have been using `caret`

extensively for the past three years, with a precious partial least squares (PLS) tutorial in my track record.

A couple of years ago, the creator and maintainer of `caret`

Max Kuhn joined RStudio where he has contributing new packages to the ongoing tidy-paranoia – the supporting `recipes`

, `yardstick`

, `rsample`

and many other packages that are part of the `tidyverse`

paradigm and I knew little about. As it happens, `caret`

is now best used with some of these. As an aspiring data scientist with fashionable hex-stickers on my laptop cover and a tendency to start any sentence with ‘big data’, I set to learn `tidyverse`

and going Super Mario using pipes (`%>%`

, Ctrl + Shift + M).

Overall, I found the ‘tidy’ approach quite enjoyable and efficient. Besides streamlining model design in a sequential manner, `recipes`

and the accompanying ‘tidy’ packages fix a common problem in the statistical learning realm that curtails predictive accuracy, data leakage. You can now generate a ‘recipe’ that lays out the plan for all the feature engineering (or data pre-processing, depending on how many hex-stickers you have) and execute it separately for validation and resampling splits (i.e. split first, process later), thus providing a more robust validation scheme. It also enables recycling any previous computations afterwards.

I drew a lot of inspiration from a webinar and a presentation from Max himself to come up with this tutorial. Max is now hands-on writing a new book, entitled *Feature Engineering and Selection: A Practical Approach for Predictive Models* that will include applications of `caret`

supported by `recipes`

. I very much look forward to read it considering he did a fantastic job with *Applied Predictive Modeling*.

The aim here is to test this new `caret`

framework on the DREAM Olfaction Prediction Challenge, a crowd-sourced initiative launched in January 2015. This challenge demonstrated predictive models can identify perceptual attributes (e.g. coffee-like, spicy and sweet notes) in a collection of fragrant compounds from physicochemical features (e.g. atom types). Results and conclusions were condensed into a research article [1] that reports an impressive predictive accuracy for a large number of models and teams, over various perceptive attributes. Perhaps more impressive yet, is the fact it anticipates the very scarce knowledge of human odour receptors and their molecular activation mechanisms. Our task, more concretely, will be to predict population-level pleasantness of hundreds of compounds from physicochemical features, using the training set of the competition.

Finally, a short announcement. Many people have been raising reproducibility issues with previous entries, due to either package archiving (e.g. `GenABEL`

, for genetic mapping) or cross-incompatibility (e.g. `caret`

, for PLS). There were easy fixes for the most part, but I have nevertheless decided to start packing scripts together with `sessionInfo()`

reports in unique GitHub repositories. The bundle for this tutorial is available under https://github.com/monogenea/olfactionPrediction. I hope this will help track down the exact package versions and system configurations I use, so that anyone can anytime fully reproduce these results. No hex-stickers required!

**NOTE that most code chunks containing pipes are corrupted. PLEASE refer to the materials from the repo.**

The DREAM Olfaction Prediction Challenge training set consists of 338 compounds characterised by 4,884 physicochemical features (the design matrix), and profiled by 49 subjects with respect to 19 semantic descriptors, such as ‘flower’, ‘sweet’ and ‘garlic’, together with intensity and pleasantness (all perceptual attributes). Two different dilutions were used per compound. The perceptual attributes were given scores in the 0-100 range. The two major goals of the competition were to use the physicochemical features in modelling *i*) the perceptual attributes at the subject-level and *ii*) both the mean and standard deviation of the perceptual attributes at the population-level. How ingenious it was to model the standard deviation, the variability in how something tastes to different people! In the end, the organisers garnered important insights about the biochemical basis of flavour, from teams all over the world. The resulting models were additionally used to reverse-engineer nonexistent compounds from target sensory profiles – an economically exciting prospect.

According to the publication, the top-five best predicted perceptual attributes at the population-level (i.e. mean values across all 49 subjects) were ‘garlic’, ‘fish’, ‘intensity’, pleasantness and ‘sweet’ (cf. Fig. 3A in [1]), all exhibiting average prediction correlations of , far greater than the average subject-level predictions (cf. Fig. 2D in [1]). This should come as no surprise since averaging over subjects is more likely to flesh out universal mechanisms. This averaging also helps stabilising the subject-level variance – no subject will necessarily utilise the full 0-100 scoring range; some will tend to use narrow intervals (low variance) while some will use the whole range instead (high variance).

Since pleasantness is one of the most general olfactory properties, I propose modelling the population-level median pleasantness of all compounds, from all physicochemical features. The median, as opposed to the mean, is less sensitive to outliers and an interesting departure from the original approach we can later compare against.

We can start off by loading all the necessary packages and pulling the data directly from the DREAM Olfaction Prediction GitHub repository. More specifically, we will create a directory called `data`

** **and import both the profiled perceptual attributes (`TrainSet.txt`

) and the physicochemical features that comprise our design matrix (`molecular_descriptors_data.txt`

). Because we are tidy people, we will use `readr`

to parse these as tab-separated values (TSV). I also suggest re-writing the column name `Compound Identifier`

in the sensory profiles into `CID`

, to match with that from the design matrix `molFeats`

.

# Mon Oct 29 13:17:55 2018 ------------------------------ ## DREAM olfaction prediction challenge library(caret) library(rsample) library(tidyverse) library(recipes) library(magrittr) library(doMC) # Create directory and download files dir.create("data/") ghurl <- "https://github.com/dream-olfaction/olfaction-prediction/raw/master/data/" download.file(paste0(ghurl, "TrainSet.txt"), destfile = "data/trainSet.txt") download.file(paste0(ghurl, "molecular_descriptors_data.txt"), destfile = "data/molFeats.txt") # Read files with readr, select least diluted compounds responses % rename(CID = `Compound Identifier`) molFeats <- read_tsv("data/molFeats.txt") # molecular features

Next, we filter compounds that are common to both datasets and reorder them accordingly. The profiled perceptual attributes span a total of 49 subjects, two different dilutions and some replications (cf. Fig. 1A in [1]). Determine the median pleasantness (i.e. `VALENCE/PLEASANTNESS`

) per compound, across all subjects and dilutions while ignoring missingness with `na.rm = T`

. The last line will ensure the order of the compounds is identical between this new dataset and the design matrix, so that we can subsequently introduce population-level pleasantness as a new column termed `Y`

in the new design matrix `X`

. We no longer need `CID`

so we can discard it.

Regarding missingness, I found maxima of 0.006% and 23% NA content over all columns and rows, respectively. A couple of compounds could be excluded but I would move on.

# Determine intersection of compounds in features and responses commonMols % dplyr::summarise(pleasantness = median(`VALENCE/PLEASANTNESS`, na.rm = T)) all(medianPlsnt$CID == molFeats$CID) # TRUE - rownames match # Concatenate predictors (molFeats) and population pleasantness X % select(-CID)

In examining the structure of the design matrix, you will see there are many skewed binary variables. In the most extreme case, if a binary predictor is either all zeros or all ones throughout it is said to be a zero-variance predictor that if anything, will harm the model training process. There is still the risk of having near-zero-variance predictors, which can be controlled based on various criteria (e.g. number of samples, size of splits). Of course, this can also impact quantitative, continuous variables. Here we will use `nearZeroVar`

from `caret`

to identify faulty predictors that are either zero-variance or display values whose frequency exceeds a predefined threshold. Having a 5x repeated 5-fold cross-validation in mind, I suggest setting `freqCut = 4`

, which will rule out *i*) binary variables with , and *ii*) continuous variables with . See `?nearZeroVar`

for more information. From 4,870 features we are left with 2,554 – a massive reduction by almost a half.

# Filter nzv X <- X[,-nearZeroVar(X, freqCut = 4)] # == 80/20

Now we will use `rsample`

to define the train-test and cross-validation splits. Please use `set.seed`

** **as it is, to ensure you will generate the same partitions and arrive to the same results. Here I have allocated 10% of the data to the test set; by additionally requesting stratification based on the target `Y`

, we are sure to have a representative subset. We can next pass the new objects to `training`

and `testing`

to effectively split the design matrix into train and test sets, respectively. The following steps consist of setting up a 5x repeated 5-fold cross-validation for the training set. Use `vfold_cv`

and convert the output to a `caret`

-like object via `rsample2caret`

. Next, initialise a `trainControl`

object and overwrite `index`

and `indexOut`

using the corresponding elements in the newly converted `vfold_cv`

output.

# Split train/test with rsample set.seed(100) initSplit <- initial_split(X, prop = .9, strata = "Y") trainSet <- training(initSplit) testSet <- testing(initSplit) # Create 5-fold cross-validation, convert to caret class set.seed(100) myFolds % rsample2caret() ctrl <- trainControl(method = "cv", selectionFunction = "oneSE") ctrl$index <- myFolds$index ctrl$indexOut <- myFolds$indexOut

Prior to modelling, I will create two reference variables – `binVars`

, to identify binary variables, and `missingVars`

, to identify any variables containing missing values. These will help with *i*) excluding binary variables from mean-centering and unit-variance scaling, and *ii*) restricting mean-based imputation to variables that do contain missing values, respectively.

# binary vars binVars <- which(sapply(X, function(x){all(x %in% 0:1)})) missingVars <- which(apply(X, 2, function(k){any(is.na(k))}))

Recipes are very simple in design. The call to `recipe`

is first given the data it is supposed to be applied on, as well as a `lm`

-style formula. The `recipes`

** **package contains a family of functions with a `step_...`

prefix, involved in encodings, date conversions, filters, imputation, normalisation, dimension reduction and a lot more. These can be linked using pipes, forming a logic sequence of operations that starts with the `recipe`

call. Supporting functions such as `all_predictors`

, `all_outcomes`

and `all_nominal`

delineate specific groups of variables at any stage in the sequence. One can also use the names of the variables, akin to my usage of `binVars`

and `missingVars`

.

I propose writing a basic recipe `myRep`

that does the following:

- Yeo-Johnson [2] transformation of all quantitative predictors;
- Mean-center all quantitative predictors;
- Unit-variance scale all quantitative predictors;
- Impute missing values with the mean of the incident variables.

# Design recipe myRec % step_YeoJohnson(all_predictors(), -binVars) %>% step_center(all_predictors(), -binVars) %>% step_scale(all_predictors(), -binVars) %>% step_meanimpute(missingVars)

In brief, the Yeo-Johnson procedure transforms variables to be distributed as Gaussian-like as possible. Before delving into the models let us tweak the recipe and assign it to `pcaRep`

, conduct a principal component analysis and examine how different compounds distribute along the first two principal components. Colour them based on their population-level pleasantness – red for very pleasant, blue for very unpleasant and the gradient in between, via `colGrad`

.

# simple PCA, plot pcaRec % step_pca(all_predictors()) myPCA % juice() colGrad <- trainSet$Y/100 # add color plot(myPCA$PC1, myPCA$PC2, col = rgb(1 - colGrad, 0, colGrad,.5), pch = 16, xlab = "PC1", ylab = "PC2") legend("topleft", pch = 16, col = rgb(c(0,.5,1), 0, c(1,.5,0), alpha = .5), legend = c("Pleasant", "Neutral", "Unpleasant"), bty = "n")

The compounds do not seem to cluster into groups, nor do they clearly separate with respect to pleasantness.

Note that `pcaRep`

itself is just a recipe on wait. Except for recipes passed to `caret::train`

, to process and extract the data as instructed you need to either ‘bake’ or ‘juice’ the recipe. The difference between the two is that ‘juicing’ outputs the data associated to the recipe (e.g. the training set), whereas ‘baking’ can be applied to process any other dataset. ‘Baking’ is done with `bake`

, provided the recipe was trained using `prep`

. I hope this explains why I used `juice`

above!

Next we have the model training. I propose training five regression models – *k*-nearest neighbours (KNN), elastic net (ENET), support vector machine with a radial basis function kernel (SVM), random forests (RF), extreme gradient boosting (XGB) and Quinlan’s Cubist (CUB) – aiming at minimising the root mean squared error (RMSE). Note I am using `tuneGrid`

and `tuneLength`

interchangeably. I rather supply predefined hyper-parameters to simple models I am more familiar with, and run the rest with a number of defaults via `tuneLength`

.

Parallelise the computations if possible. With macOS, I can use `registerDoMC`

from the `doMC`

package to harness multi-threading of the training procedure. If you are running a different OS, please use a different library if necessary. To the best of my knowledge, `doMC`

will also work in Linux but Windows users might need to use the `doParallel`

package instead.

# Train doMC::registerDoMC(10) knnMod <- train(myRec, data = trainSet, method = "knn", tuneGrid = data.frame(k = seq(5, 25, by = 4)), trControl = ctrl) enetMod <- train(myRec, data = trainSet, method = "glmnet", tuneGrid = expand.grid(alpha = seq(0, 1, length.out = 5), lambda = seq(.5, 2, length.out = 5)), trControl = ctrl) svmMod <- train(myRec, data = trainSet, method = "svmRadial", tuneLength = 8, trControl = ctrl) rfMod <- train(myRec, data = trainSet, method = "ranger", tuneLength = 8, num.trees = 5000, trControl = ctrl) xgbMod <- train(myRec, data = trainSet, method = "xgbTree", tuneLength = 8, trControl = ctrl) cubMod <- train(myRec, data = trainSet, method = "cubist", tuneLength = 8, trControl = ctrl) modelList <- list("KNN" = knnMod, "ENET" = enetMod, "SVM" = svmMod, "RF" = rfMod, "XGB" = xgbMod, "CUB" = cubMod)

Once the training is over, we can compare the optimal five cross-validated models based on their RMSE across all resamples, using some `bwplot`

magic sponsored by `lattice`

. In my case the runtime was unusually long (>2 hours) compared to previous releases.

bwplot(resamples(modelList), metric = "RMSE")

The cross-validated RSME does not vary considerably across the six optimal models. To conclude, I propose creating a model ensemble by simply taking the average predictions of all six optimal models on the test set, to compare observed and predicted population-level pleasantness in this hold-out subset of compounds.

# Validate on test set with ensemble allPreds <- sapply(modelList, predict, newdata = testSet) ensemblePred <- rowSums(allPreds) / length(modelList) # Plot predicted vs. observed; create PNG plot(ensemblePred, testSet$Y, xlim = c(0,100), ylim = c(0,100), xlab = "Predicted", ylab = "Observed", pch = 16, col = rgb(0, 0, 0, .25)) abline(a=0, b=1) writeLines(capture.output(sessionInfo()), "sessionInfo")

The ensemble fit on the test set is not terrible – slightly underfit, predicting the poorest in the two extremes of the pleasantness range. All in all, it attained a prediction correlation of , which is slightly larger than the mean reported (cf. Fig. 3A in [1]). However, note that there are only 31 compounds in the test set. A model like this could be easily moved into production stage or repurposed to solve a molecular structure from a sensory profile of interest, as mentioned earlier.

The revamped `caret`

framework is still under development, but it seems to offer

- A substantially expanded, unified syntax for models and utils. Keep an eye on
`textrecipes`

, an upcoming complementary package for processing textual data; - A sequential streamlining of extraction, processing and modelling, enabling recycling of previous computations;
- Executing-after-splitting, thereby precluding leaky validation strategies.

On the other hand, I hope to see improvements in

- Decoupling from proprietary frameworks in the likes of
`tidyverse`

, although there are alternatives; - The runtime. I suspect the recipes themselves are the culprit, not the models. Future updates will certainly fix this.

Regarding the DREAM Olfaction Prediction Challenge, we were able to predict population-level perceived pleasantness from physicochemical features with an accuracy as good as that achieved, in average, by the competing teams. Our model ensemble seems underfit, judging from the narrow range of predicted values. Maybe we could be less restrictive about the near-zero-variance predictors and use a repeated 3-fold cross-validation.

I hope you had as much fun as I did in dissecting a fascinating problem while unveiling a bit of the new `caret`

. You can use this code as a template for exploring different recipes and models. If you encounter any problems or have any questions do not hesitate to post a comment underneath – we all learn from each other!

[1] Keller, Andreas *et al*. (2017). Predicting human olfactory perception from chemical features of odor molecules. *Science*, 355(6327), 820-826.

[2] Yeo, In-Kwon and Johnson, Richard (2000). A new family of power transformations to improve normality or symmetry. *Biometrika*, 87(4), 954-959.

To **leave a comment** for the author, please follow the link and comment on their blog: ** poissonisfish**.

R-bloggers.com offers

(This article was first published on ** Revolutions**, and kindly contributed to R-bloggers)

Mango Solutions presented the EARL conference in Seattle this November and I was fortunate enough to have time to attend. During the conference I took notes on the presentations, which I’ll pass along to you.

The EARL conference started with a panel discussion on R. Moderated by Doug Ashton from Mango Solutions, the panel included Julia Silge with Stack Overflow, David Smith with Microsoft, and Joe Cheng with Rstudio.

Topics ranged from the tidyverse, R certification, R as a general purpose language, Shiny, and the future of R. I captured some interesting quotes from the panel members:

Regarding *R certification*, David Smith points out that certifications are important for big companies trying to filter employment applications. He mentions that certification is a minimum bar for some HR departments. Julia Silge mentions that Stack Overflow has found certifications to be less influential in the hiring process.

*R as a general purpose language*: Joe Cheng feels that R is useful for more than just statistics, but that Rstudio isn’t interested in developing general purpose tools. There was discussion around Python as the “second best” language for a lot of applications, and an agreement that R should remain focused on data science.

Most interesting was the discussion regarding *the future of R*. Julia Silge points out that Stack Overflow data shows R growing fast year over year — at about the same rate as Python. There are a lot of new users and packages need to take that into account.

Tim Oldfield introduces this conference as NOT a code-based conference. However, __Julia Silge__ doesn't hesitate to illustrate her discussion with appropriate code. And seriously, how would you discuss natural language processing without using the language of the trade?

I won't get into the terms (TF-IDF) and technology (tidytext) of Mz Silge's __presentation__. I will mention she does a great job of summarizing how and why to perform text mining. Like all good tech, you can easily scratch the surface of text mining in fifteen minutes. A thorough understanding requires years of hard research. If you'd like an introduction to her work, take a look at her paper __She Giggles, He Gallops – analyzing gender tropes in film__.

__David Smith__ with Microsoft presented a session on neural nets, machine learning and transfer learning. More than just a theoretical chat, David illustrated the concepts with working tools. I’ve read quite a bit about machine learning – but this illustration really brings it home. Oh — and it’s pretty damn funny. ( David posted this on a blog entry here )

__Eina Ooka__ found herself forced to incorporate Excel with her data workflow. Now — we all have opinions about Excel in data science — but Eina points out that for multidisciplinary data science teams, it’s good for data storage, modeling, and reports. There *are* issues about reproducibility and source control and for that, R is a good solution. But Eina summarizes that excel is still useful. Not all projects can move away from it.

__Stephanie Kirmer__ provided real-life experience and lessons learned on Data Science teams. Her themes included collaboration, version control, reproducibility, institutional knowledge, and other concerns. She has accomplished this with the consistent use of R packages.

One of her most interesting concepts was using packages to capture institutional knowledge. Documenting procedures with a function, contained in a standardized package provides stability and a common tool. This package then becomes an on-boarding tool for new hires and a knowledge repository for departing staff.

Risk is the chance that something bad will happen. __David Severski__ with Starbucks revealed some of the tools used by the coffee giant, specifically OpenFAIR (Open Factor Analysis of Information Risk) and EvaluatoR (an R package for risk evaluation). Dave points out that R is an elegant language for data tasks. It has an ecosystem of tools for simulations and statistics, making risk evaluation a plug-and-play process.

Personally, I don’t have call for risk evaluation. But it’s interesting to get a quick peek into the language and concerns of this specialty.

__Mark Gallivan__ reminds us about the *science *in *data science.* He researched the effect of Zika on travel by searching twitter for #babymoon. With that data, he cross-references on the geolocation of the tweet to draw conclusions of the impact of Zika on travel plans of pregnant women. This is one of those useful presentations on the nuts and bolts of research.

On November 7^{th} I attended the EARL (Enterprise Applications of the R Language) conference in Seattle. Two days later I attended Orycon, the annual science fiction convention in Portland, Oregon. After every conference I attend, I do a private post-mortem. I ask myself if the conference was worthwhile, if I’d attend again, and if my time was well-spent.

EARL is a deep dive into the application of the R language. Attendees are assumed to have deep understanding of R, statistics, and a domain knowledge; the quintessential definition of data science.

Orycon is a gathering of science fiction fans. It includes a crowd of cosplayers imitating the latest Star Wars characters — but I’ll ignore them for this discussion. To be clear, the people that appreciate science fiction are deeply involved in science fact.

As a result of attending EARL, I’m better prepared to understand the talent and experience slightly under the radar at Orycon. I already knew the methods the experts used to perform and document their research. Thanks to David Smith’s “not hotdog” I understand machine learning at an operational level, so can skip over that discussion — or correct bad science from a misinformed espouser of pseudo-fact.

*Mark is an author, educator, and writer and teaches about R and Raspberry Pi at LinkedIn Learning.*

To **leave a comment** for the author, please follow the link and comment on their blog: ** Revolutions**.

R-bloggers.com offers

- New R Cheatsheet: Data Science Workflow with R (
**295**likes) - Coding Regression trees in 150 lines of R code (
**90**likes) - ‘How do neural nets learn?’ A step by step explanation using the H2O Deep Learning algorithm. (
**88**likes) - Detailed introduction of “myprettyreport” R package (
**83**likes) - How to Sort Data in R (
**76**likes) - Explore Your Dataset in R (
**75**likes) - Tesseract 4 is here! State of the art OCR in R! (
**74**likes) - “A Guide to Working With Census Data in R” is now Complete! (
**70**likes) - 4 ways to be more efficient using RStudio’s Code Snippets, with 11 ready to use examples (
**69**likes) - Show R shiny app in WordPress (
**67**likes) - Integrating R and Telegram (
**57**likes) - Working with US Census Data in R (
**51**likes) - R shiny stock analysis (
**48**likes) - How to Use Simulated Data to Check Choice Model Experimental Designs Using R (
**47**likes) - anytime – dates in R (
**46**likes) - Practical statistics books for software engineers (
**43**likes) - One-arm Bayesian Adaptive Trial Simulation Code (
**42**likes) - T-mobile uses R for Customer Service AI (
**40**likes) - Image segmentation based on Superpixels and Clustering (
**40**likes) - Thrice: Sentiment Analysis – Emotions in Lyrics! (
**39**likes) - Coding Gradient boosted machines in 100 lines of code (
**36**likes) - The “probability to win” is hard to estimate… (
**34**likes) - Source and List: Organizing R Shiny Apps (
**34**likes) - RStudio 1.2 Preview – New Features in RStudio Pro (
**33**likes) - R tip: Make Your Results Clear with sigr (
**33**likes) - Adding different annotation to each facet in ggplot (
**31**likes) - In-database xgboost predictions with R (
**27**likes) - Where to live in Japan: XKCD-themed climate plots and maps! (
**27**likes) - Exploring Models with lime (
**25**likes) - Building a Repository of Alpine-based Docker Images for R, Part I (
**24**likes) - Visualize the World Cup with R! Part 1: Recreating Goals with ggsoccer and ggplot2 (
**22**likes) - Escaping the macOS 10.14 (Mojave) Filesystem Sandbox with R / RStudio (
**21**likes) - The Gamification Of Fitbit: How an API Provided the Next Level of tRaining (
**20**likes) - Management accounting and controlling in R (
**20**likes) - Data Science With R Course Series – Week 8 (
**20**likes) - Thrice: Breaking Down The Lyrics Word-by-Word! (
**17**likes) - Exploring Japan’s Postwar Economic Miracle with gganimate, tweenr, & highcharter! (
**16**likes) - My first R package building experience: Reflections from creating bulletchartr! (
**16**likes) - Can we predict the crawling of the Google-Bot? (
**16**likes) - AzureR: R packages to control Azure services (
**15**likes)

(This article was first published on ** Econometrics and Free Software**, and kindly contributed to R-bloggers)

In this blog post, I’ll use the data that I cleaned in a previous

blog post, which you can download

here. If you want to follow along,

download the monthly data. In my last blog post

I showed how to perform a grid search the “tidy” way. As an example, I looked for the right

hyperparameters of a SARIMA model. However, the goal of the post was not hyperparameter optimization

per se, so I did not bother with tuning the hyperparameters on a validation set, and used the test

set for both validation of the hyperparameters and testing the forecast. Of course, this is not great

because doing this might lead to overfitting the hyperparameters to the test set. So in this blog post

I split my data into trainig, validation and testing sets and use a genetic algorithm to look

for the hyperparameters. Again, this is not the most optimal way to go about this problem, since

the `{forecast}`

package contains the very useful `auto.arima()`

function. I just wanted to see

what kind of solution a genetic algorithm would return, and also try different cost functions.

If you’re interested, read on!

Let’s first load some libraries and define some helper functions (the helper functions were explained

in the previous blog posts):

```
library(tidyverse)
library(forecast)
library(rgenoud)
library(parallel)
library(lubridate)
library(furrr)
library(tsibble)
library(brotools)
ihs <- function(x){
log(x + sqrt(x**2 + 1))
}
to_tibble <- function(forecast_object){
point_estimate <- forecast_object$mean %>%
as_tsibble() %>%
rename(point_estimate = value,
date = index)
upper <- forecast_object$upper %>%
as_tsibble() %>%
spread(key, value) %>%
rename(date = index,
upper80 = `80%`,
upper95 = `95%`)
lower <- forecast_object$lower %>%
as_tsibble() %>%
spread(key, value) %>%
rename(date = index,
lower80 = `80%`,
lower95 = `95%`)
reduce(list(point_estimate, upper, lower), full_join)
}
```

Now, let’s load the data:

`avia_clean_monthly <- read_csv("https://raw.githubusercontent.com/b-rodrigues/avia_par_lu/master/avia_clean_monthy.csv")`

```
## Parsed with column specification:
## cols(
## destination = col_character(),
## date = col_date(format = ""),
## passengers = col_integer()
## )
```

Let’s split the data into a train set, a validation set and a test set:

```
avia_clean_train <- avia_clean_monthly %>%
select(date, passengers) %>%
filter(year(date) < 2013) %>%
group_by(date) %>%
summarise(total_passengers = sum(passengers)) %>%
pull(total_passengers) %>%
ts(., frequency = 12, start = c(2005, 1))
avia_clean_validation <- avia_clean_monthly %>%
select(date, passengers) %>%
filter(between(year(date), 2013, 2016)) %>%
group_by(date) %>%
summarise(total_passengers = sum(passengers)) %>%
pull(total_passengers) %>%
ts(., frequency = 12, start = c(2013, 1))
avia_clean_test <- avia_clean_monthly %>%
select(date, passengers) %>%
filter(year(date) >= 2016) %>%
group_by(date) %>%
summarise(total_passengers = sum(passengers)) %>%
pull(total_passengers) %>%
ts(., frequency = 12, start = c(2016, 1))
logged_test_data <- ihs(avia_clean_test)
logged_validation_data <- ihs(avia_clean_validation)
logged_train_data <- ihs(avia_clean_train)
```

I will train the models on data from 2005 to 2012, look for the hyperparameters on data from 2013

to 2016 and test the accuracy on data from 2016 to March 2018. For this kind of exercise, the ideal

situation would be to perform cross-validation. Doing this with time-series data is not obvious

because of the autocorrelation between observations, which would be broken by sampling independently

which is required by CV. Also, if for example you do leave-one-out CV,

you would end up trying to predict a point in, say, 2017, with data

from 2018, which does not make sense. So you should be careful about that. `{forecast}`

is able

to perform CV for time series and `scikit-learn`

, the

Python package, is able to perform

cross-validation of time series data

too. I will not do it in this blog post and simply focus on the genetic algorithm part.

Let’s start by defining the cost function to minimize. I’ll try several, in the first one I will

minimize the RMSE:

```
cost_function_rmse <- function(param, train_data, validation_data, forecast_periods){
order <- param[1:3]
season <- c(param[4:6], 12)
model <- purrr::possibly(arima, otherwise = NULL)(x = train_data, order = order,
seasonal = season,
method = "ML")
if(is.null(model)){
return(9999999)
} else {
forecast_model <- forecast::forecast(model, h = forecast_periods)
point_forecast <- forecast_model$mean
sqrt(mean(point_forecast - validation_data) ** 2)
}
}
```

If `arima()`

is not able to estimate a model for the given parameters, I force it to return `NULL`

,

and in that case force the cost function to return a very high cost. If a model was successfully estimated,

then I compute the RMSE.

Let’s also take a look at what `auto.arima()`

says:

```
starting_model <- auto.arima(logged_train_data)
summary(starting_model)
```

```
## Series: logged_train_data
## ARIMA(1,0,2)(2,1,0)[12]
##
## Coefficients:
## ar1 ma1 ma2 sar1 sar2
## 0.9754 -0.7872 0.2091 -0.7285 -0.4413
## s.e. 0.0261 0.1228 0.1213 0.1063 0.1150
##
## sigma^2 estimated as 0.004514: log likelihood=105.61
## AIC=-199.22 AICc=-198.13 BIC=-184.64
##
## Training set error measures:
## ME RMSE MAE MPE MAPE
## Training set 0.008398036 0.06095102 0.03882593 0.07009285 0.3339574
## MASE ACF1
## Training set 0.4425794 0.02073886
```

Let’s compute the cost at this vector of parameters:

```
cost_function_rmse(c(1, 0, 2, 2, 1, 0),
train_data = logged_train_data,
validation_data = logged_validation_data,
forecast_periods = 65)
```

`## [1] 0.1731473`

Ok, now let’s start with optimizing the hyperparameters. Let’s help the genetic algorithm a little

bit by defining where it should perform the search:

`domains <- matrix(c(0, 3, 0, 2, 0, 3, 0, 3, 0, 2, 0, 3), byrow = TRUE, ncol = 2)`

This matrix constraints the first parameter to lie between 0 and 3, the second one between 0 and 2,

and so on.

Let’s call the `genoud()`

function from the `{rgenoud}`

package, and use 8 cores:

```
cl <- makePSOCKcluster(8)
clusterExport(cl, c('logged_train_data', 'logged_validation_data'))
tic <- Sys.time()
auto_arima_rmse <- genoud(cost_function_rmse,
nvars = 6,
data.type.int = TRUE,
starting.values = c(1, 0, 2, 2, 1, 0), # <- from auto.arima
Domains = domains,
cluster = cl,
train_data = logged_train_data,
validation_data = logged_validation_data,
forecast_periods = length(logged_validation_data),
hard.generation.limit = TRUE)
toc_rmse <- Sys.time() - tic
```

`makePSOCKcluster()`

is a function from the `{parallel}`

package. I must also *export* the global

variables `logged_train_data`

or `logged_validation_data`

. If I don’t do that, the workers called

by `genoud()`

will not *know* about these variables and an error will be returned. The option

`data.type.int = TRUE`

force the algorithm to look only for integers, and `hard.generation.limit = TRUE`

forces the algorithm to stop after 100 generations.

The process took 7 minutes, which is faster than doing the grid search.

What was the solution found?

`auto_arima_rmse`

```
## $value
## [1] 0.0001863039
##
## $par
## [1] 3 2 1 1 2 1
##
## $gradients
## [1] NA NA NA NA NA NA
##
## $generations
## [1] 11
##
## $peakgeneration
## [1] 1
##
## $popsize
## [1] 1000
##
## $operators
## [1] 122 125 125 125 125 126 125 126 0
```

Let’s train the model using the `arima()`

function at these parameters:

```
best_model_rmse <- arima(logged_train_data, order = auto_arima_rmse$par[1:3],
season = list(order = auto_arima_rmse$par[4:6], period = 12),
method = "ML")
summary(best_model_rmse)
```

```
##
## Call:
## arima(x = logged_train_data, order = auto_arima_rmse$par[1:3], seasonal = list(order = auto_arima_rmse$par[4:6],
## period = 12), method = "ML")
##
## Coefficients:
## ar1 ar2 ar3 ma1 sar1 sma1
## -0.6999 -0.4541 -0.0476 -0.9454 -0.4996 -0.9846
## s.e. 0.1421 0.1612 0.1405 0.1554 0.1140 0.2193
##
## sigma^2 estimated as 0.006247: log likelihood = 57.34, aic = -100.67
##
## Training set error measures:
## ME RMSE MAE MPE MAPE
## Training set -0.0006142355 0.06759545 0.04198561 -0.005408262 0.3600483
## MASE ACF1
## Training set 0.4386693 -0.008298546
```

Let’s extract the forecasts:

```
best_model_rmse_forecast <- forecast::forecast(best_model_rmse, h = 65)
best_model_rmse_forecast <- to_tibble(best_model_rmse_forecast)
```

```
## Joining, by = "date"
## Joining, by = "date"
```

```
starting_model_forecast <- forecast(starting_model, h = 65)
starting_model_forecast <- to_tibble(starting_model_forecast)
```

```
## Joining, by = "date"
## Joining, by = "date"
```

and plot the forecast to see how it looks:

```
avia_clean_monthly %>%
group_by(date) %>%
summarise(total = sum(passengers)) %>%
mutate(total_ihs = ihs(total)) %>%
ggplot() +
ggtitle("Minimization of RMSE") +
geom_line(aes(y = total_ihs, x = date), colour = "#82518c") +
scale_x_date(date_breaks = "1 year", date_labels = "%m-%Y") +
geom_ribbon(data = best_model_rmse_forecast, aes(x = date, ymin = lower95, ymax = upper95),
fill = "#666018", alpha = 0.2) +
geom_line(data = best_model_rmse_forecast, aes(x = date, y = point_estimate),
linetype = 2, colour = "#8e9d98") +
geom_ribbon(data = starting_model_forecast, aes(x = date, ymin = lower95, ymax = upper95),
fill = "#98431e", alpha = 0.2) +
geom_line(data = starting_model_forecast, aes(x = date, y = point_estimate),
linetype = 2, colour = "#a53031") +
theme_blog()
```

The yellowish line and confidence intervals come from minimizing the genetic algorithm, and the

redish from `auto.arima()`

. Interesting; the point estimate is very precise, but the confidence

intervals are very wide. Low bias, high variance.

Now, let’s try with another cost function, where I minimize the BIC, similar to the `auto.arima()`

function:

```
cost_function_bic <- function(param, train_data, validation_data, forecast_periods){
order <- param[1:3]
season <- c(param[4:6], 12)
model <- purrr::possibly(arima, otherwise = NULL)(x = train_data, order = order,
seasonal = season,
method = "ML")
if(is.null(model)){
return(9999999)
} else {
BIC(model)
}
}
```

Let’s take a look at the cost at the parameter values returned by `auto.arima()`

:

```
cost_function_bic(c(1, 0, 2, 2, 1, 0),
train_data = logged_train_data,
validation_data = logged_validation_data,
forecast_periods = 65)
```

`## [1] -184.6397`

Let the genetic algorithm run again:

```
cl <- makePSOCKcluster(8)
clusterExport(cl, c('logged_train_data', 'logged_validation_data'))
tic <- Sys.time()
auto_arima_bic <- genoud(cost_function_bic,
nvars = 6,
data.type.int = TRUE,
starting.values = c(1, 0, 2, 2, 1, 0), # <- from auto.arima
Domains = domains,
cluster = cl,
train_data = logged_train_data,
validation_data = logged_validation_data,
forecast_periods = length(logged_validation_data),
hard.generation.limit = TRUE)
toc_bic <- Sys.time() - tic
```

This time, it took 6 minutes, a bit slower than before. Let’s take a look at the solution:

`auto_arima_bic`

```
## $value
## [1] -201.0656
##
## $par
## [1] 0 1 1 1 0 1
##
## $gradients
## [1] NA NA NA NA NA NA
##
## $generations
## [1] 12
##
## $peakgeneration
## [1] 1
##
## $popsize
## [1] 1000
##
## $operators
## [1] 122 125 125 125 125 126 125 126 0
```

Let’s train the model at these parameters:

```
best_model_bic <- arima(logged_train_data, order = auto_arima_bic$par[1:3],
season = list(order = auto_arima_bic$par[4:6], period = 12),
method = "ML")
summary(best_model_bic)
```

```
##
## Call:
## arima(x = logged_train_data, order = auto_arima_bic$par[1:3], seasonal = list(order = auto_arima_bic$par[4:6],
## period = 12), method = "ML")
##
## Coefficients:
## ma1 sar1 sma1
## -0.6225 0.9968 -0.832
## s.e. 0.0835 0.0075 0.187
##
## sigma^2 estimated as 0.004145: log likelihood = 109.64, aic = -211.28
##
## Training set error measures:
## ME RMSE MAE MPE MAPE
## Training set 0.003710982 0.06405303 0.04358164 0.02873561 0.3753513
## MASE ACF1
## Training set 0.4553447 -0.03450603
```

And let’s plot the results:

```
best_model_bic_forecast <- forecast::forecast(best_model_bic, h = 65)
best_model_bic_forecast <- to_tibble(best_model_bic_forecast)
```

```
## Joining, by = "date"
## Joining, by = "date"
```

```
avia_clean_monthly %>%
group_by(date) %>%
summarise(total = sum(passengers)) %>%
mutate(total_ihs = ihs(total)) %>%
ggplot() +
ggtitle("Minimization of BIC") +
geom_line(aes(y = total_ihs, x = date), colour = "#82518c") +
scale_x_date(date_breaks = "1 year", date_labels = "%m-%Y") +
geom_ribbon(data = best_model_bic_forecast, aes(x = date, ymin = lower95, ymax = upper95),
fill = "#5160a0", alpha = 0.2) +
geom_line(data = best_model_bic_forecast, aes(x = date, y = point_estimate),
linetype = 2, colour = "#208480") +
geom_ribbon(data = starting_model_forecast, aes(x = date, ymin = lower95, ymax = upper95),
fill = "#98431e", alpha = 0.2) +
geom_line(data = starting_model_forecast, aes(x = date, y = point_estimate),
linetype = 2, colour = "#a53031") +
theme_blog()
```

The solutions are very close, both in terms of point estimates and confidence intervals. Bias

increased, but variance lowered… This gives me an idea! What if I minimize the RMSE, while

keeping the number of parameters low, as a kind of regularization? This is somewhat what minimising

BIC does, but let’s try to do it a more “naive” approach:

```
cost_function_rmse_low_k <- function(param, train_data, validation_data, forecast_periods, max.order){
order <- param[1:3]
season <- c(param[4:6], 12)
if(param[1] + param[3] + param[4] + param[6] > max.order){
return(9999999)
} else {
model <- purrr::possibly(arima, otherwise = NULL)(x = train_data,
order = order,
seasonal = season,
method = "ML")
}
if(is.null(model)){
return(9999999)
} else {
forecast_model <- forecast::forecast(model, h = forecast_periods)
point_forecast <- forecast_model$mean
sqrt(mean(point_forecast - validation_data) ** 2)
}
}
```

This is also similar to what `auto.arima()`

does; by default, the `max.order`

argument in `auto.arima()`

is set to 5, and is the sum of `p + q + P + Q`

. So I’ll try something similar.

Let’s take a look at the cost at the parameter values returned by `auto.arima()`

:

```
cost_function_rmse_low_k(c(1, 0, 2, 2, 1, 0),
train_data = logged_train_data,
validation_data = logged_validation_data,
forecast_periods = 65,
max.order = 5)
```

`## [1] 0.1731473`

Let’s see what will happen:

```
cl <- makePSOCKcluster(8)
clusterExport(cl, c('logged_train_data', 'logged_validation_data'))
tic <- Sys.time()
auto_arima_rmse_low_k <- genoud(cost_function_rmse_low_k,
nvars = 6,
data.type.int = TRUE,
starting.values = c(1, 0, 2, 2, 1, 0), # <- from auto.arima
max.order = 5,
Domains = domains,
cluster = cl,
train_data = logged_train_data,
validation_data = logged_validation_data,
forecast_periods = length(logged_validation_data),
hard.generation.limit = TRUE)
toc_rmse_low_k <- Sys.time() - tic
```

It took 1 minute to train this one, quite fast! Let’s take a look:

`auto_arima_rmse_low_k`

```
## $value
## [1] 0.002503478
##
## $par
## [1] 1 2 0 3 1 0
##
## $gradients
## [1] NA NA NA NA NA NA
##
## $generations
## [1] 11
##
## $peakgeneration
## [1] 1
##
## $popsize
## [1] 1000
##
## $operators
## [1] 122 125 125 125 125 126 125 126 0
```

And let’s plot it:

```
best_model_rmse_low_k <- arima(logged_train_data, order = auto_arima_rmse_low_k$par[1:3],
season = list(order = auto_arima_rmse_low_k$par[4:6], period = 12),
method = "ML")
summary(best_model_rmse_low_k)
```

```
##
## Call:
## arima(x = logged_train_data, order = auto_arima_rmse_low_k$par[1:3], seasonal = list(order = auto_arima_rmse_low_k$par[4:6],
## period = 12), method = "ML")
##
## Coefficients:
## ar1 sar1 sar2 sar3
## -0.6468 -0.7478 -0.5263 -0.1143
## s.e. 0.0846 0.1171 0.1473 0.1446
##
## sigma^2 estimated as 0.01186: log likelihood = 57.88, aic = -105.76
##
## Training set error measures:
## ME RMSE MAE MPE MAPE
## Training set 0.0005953302 0.1006917 0.06165919 0.003720452 0.5291736
## MASE ACF1
## Training set 0.6442205 -0.3706693
```

```
best_model_rmse_low_k_forecast <- forecast::forecast(best_model_rmse_low_k, h = 65)
best_model_rmse_low_k_forecast <- to_tibble(best_model_rmse_low_k_forecast)
```

```
## Joining, by = "date"
## Joining, by = "date"
```

```
avia_clean_monthly %>%
group_by(date) %>%
summarise(total = sum(passengers)) %>%
mutate(total_ihs = ihs(total)) %>%
ggplot() +
ggtitle("Minimization of RMSE + low k") +
geom_line(aes(y = total_ihs, x = date), colour = "#82518c") +
scale_x_date(date_breaks = "1 year", date_labels = "%m-%Y") +
geom_ribbon(data = best_model_rmse_low_k_forecast, aes(x = date, ymin = lower95, ymax = upper95),
fill = "#5160a0", alpha = 0.2) +
geom_line(data = best_model_rmse_low_k_forecast, aes(x = date, y = point_estimate),
linetype = 2, colour = "#208480") +
geom_ribbon(data = starting_model_forecast, aes(x = date, ymin = lower95, ymax = upper95),
fill = "#98431e", alpha = 0.2) +
geom_line(data = starting_model_forecast, aes(x = date, y = point_estimate),
linetype = 2, colour = "#a53031") +
theme_blog()
```

Looks like this was not the right strategy. There might be a better cost function than what I have

tried, but looks like minimizing the BIC is the way to go.

Hope you enjoyed! If you found this blog post useful, you might want to follow

me on twitter for blog post updates or

buy me an espresso.

To **leave a comment** for the author, please follow the link and comment on their blog: ** Econometrics and Free Software**.

R-bloggers.com offers

(This article was first published on ** Revolutions**, and kindly contributed to R-bloggers)

In case you missed them, here are some articles from October of particular interest to R users.

Peter Provost ports some 80's-era BASIC programs for kids to R.

In a podcast for Fringe FM, I discuss the ethics of AI, Microsoft and Open Source, and the R Community.

Roundup of AI, Machine Learning and Data Science news from October 2018.

In this episode of "Guy in a Cube", R is used to visualize Anscombe's Quartet via Power BI.

Di Cook suggests using computer vision to automate statistical model assessment for machine learning in the 2018 Belz Lecture.

R provides the analysis behind a front-page story on bridge safety in the Baltimore Sun.

Tomas Kalibera describes the big impacts of a small tweak to the logical comparison operators in R.

The Economist is now using R to calculate its famous "Big Mac Index".

Behind-the-scenes details of how R gets built on Windows, from a presentation by Jeroen Ooms.

The R Consortium has accepted another round of grant applications for R community projects.

A list of upcoming R conferences.

A recap of AI, Machine Learning and Data Science announcements from the Microsoft Ignite conference.

And some general interest stories (not necessarily related to R):

- A lesson on diversity: the Parable of the Polygons
- The story behind the baseball scene in The Naked Gun
- Public Key Cryptography, as explained by IKEA

As always, thanks for the comments and please send any suggestions to me at davidsmi@microsoft.com. Don't forget you can follow the blog using an RSS reader, via email using blogtrottr, or by following me on Twitter (I'm @revodavid). You can find roundups of previous months here.

To **leave a comment** for the author, please follow the link and comment on their blog: ** Revolutions**.

R-bloggers.com offers