(This article was first published on ** Revolutions**, and kindly contributed to R-bloggers)

The future package is a powerful and elegant cross-platform framework for orchestrating asynchronous computations in R. It's ideal for working with computations that take a long time to complete; that would benefit from using distributed, parallel frameworks to make them complete faster; and that you'd rather not have locking up your interactive R session. You can get a good sense of the future package from its introductory vignette or from this eRum 2018 presentation by author by Henrik Bengtsson (video embedded below), but at its simplest it allows constructs in R like this:

a %<-% slow_calculation(1:50) b %<-% slow_calculation(51:100) a+b

The idea here is that `slow_calculation`

is an R function that takes a lot of time, but with the special `%<-%`

assignment operator the computation begins *and the R interpreter is ready immediately*. The first two lines of R code above take essentially zero time to execute. The futures package farms off those computations to another process or even a remote system (you specify which with a preceding `plan`

call), and R will only halt when the *result* is needed, as in the third line above. This is beneficial in Bengtsson's own work, where he uses the future package to parallelize cancer research on DNA sequences on high-performance computing (HPC) clusters.

The future package supports a wide variety of computation frameworks including parallel local R sessions, remote R sessions, and cluster computing frameworks. (If you can't use any of these, it falls back to evaluating the expressions locally, in sequence.) The future package also works in concert other parallel programming systems already available in R. For example, it provides future_lapply as a futurized analog of lapply, which will use whatever computation plan you have defined to run the computations in parallel.

The future package also extends the foreach package thanks to the updated doFuture package. By using `registerDoFuture`

as the foreach backend, your loops can use any computation plan provided by the future package to run the iterations in parallel. (The same applies to R packages that use foreach internally, notably the caret package.) This means you can now use foreach with any of the HPC schedulers supported by future, which includes TORQUE, Slurm, and OpenLava. So if you you share a Slurm HPC cluster with colleagues in your department, you can queue up a parallel simulation on the cluster using code like this:

library("doFuture") registerDoFuture() library("future.batchtools") plan(batchjobs_slurm) mu <- 1.0 sigma <- 2.0 x <- foreach(i = 1:3, .export = c("mu", "sigma")) %dopar% { rnorm(i, mean = mu, sd = sigma) }

The future package is available on CRAN now, and works consistently on Windows, Mac and Linux systems. You can learn more in the video at the end of this post, or in the recent blog update linked below.

To **leave a comment** for the author, please follow the link and comment on their blog: ** Revolutions**.

R-bloggers.com offers

(This article was first published on ** R - Data Science Heroes Blog**, and kindly contributed to R-bloggers)

This is a post about feature selection using genetic algorithms in R, in which we will do a quick review about:

- What are genetic algorithms?
- GA in ML?
- What does a solution look like?
- GA process and its operators
- The fitness function
- Genetics Algorithms in R!
- Try it yourself
- Relating concepts

*Animation source: "Flexible Muscle-Based Locomotion for Bipedal Creatures" – Thomas Geijtenbeek*

Imagine a black box which can help us to decide over an **unlimited number of possibilities**, with a criterion such that we can find an acceptable solution (both in time and quality) to a problem that we formulate.

Genetic Algortithms (GA) are a mathematical model inspired by the famous Charles Darwin’s idea of *natural selection*.

The natural selection preserves only the fittest individuals, over the different generations.

Imagine a population of 100 rabbits in 1900, if we look the population today, we are going to others rabbits more fast and skillful to find food than their ancestors.

In **machine learning**, one of the uses of genetic algorithms is to pick up the right number of variables in order to create a predictive model.

To pick up the right subset of variables is a problem of **combinatory and optimization**.

The advantage of this technique over others is, it allows the best solution to emerge from the best of prior solutions. An evolutionary algorithm which improves the selection over time.

The idea of GA is to combine the different solutions **generation after generation** to extract the best *genes* (variables) from each one. That way it creates new and more fitted individuals.

We can find other uses of GA such as hyper-tunning parameter, find the maximum (or min) of a function or the search for a correct neural network arquitecture (Neuroevolution), or among others…

Every possible solution of the GA, which are the selected variables (a *single* ), are **considered as a whole**, it will not rank variables individually against the target.

And this is important because we already know that variables work in group.

Keeping it simple for the example, imagine we have a total of 6 variables,

One solution can be picking up 3 variables, let’s say: `var2`

, `var4`

and `var5`

.

Another solution can be: `var1`

and `var5`

.

These solutions are the so-called **individuals** or **chromosomes** in a population. They are possible solutions to our problem.

*Credit image: Vijini Mallawaarachchi*

From the image, the solution 3 can be expressed as a one-hot vector: `c(1,0,1,0,1,1)`

. Each `1`

indicates the solution containg that variable. In this case: `var1`

, `var3`

, `var5`

, `var6`

.

While the solution 4 is: `c(1,1,0,1,1,0)`

.

Each position in the vector is a **gene**.

The underlying idea of a GA is to generate some random possible solutions (called `population`

), which represent different variables, to then combine the best solutions in an iterative process.

This combination follows the basic GA operations, which are: selection, mutation and cross-over.

**Selection**: Pick up the most fitted individuals in a generation (i.e.: the solutions providing the highest ROC).**Cross-over**: Create 2 new individuals, based on the genes of two solutions. These children will appear to the next generation.**Mutation**: Change a gene randomly in the individual (i.e.: flip a`0`

to`1`

)

The idea is for each generation, we will find better individuals, like a fast rabbit.

I recommend the post of Vijini Mallawaarachchi about how a genetic algorithm works.

These basic operations allow the algorithm to change the possible solutions by combining them in a way that maximizes the objective.

This objective maximization is, for example, to keep with the solution that maximizes the area under the ROC curve. This is defined in the *fitness function*.

The fitness function takes a possible solution (or chromosome, if you want to sound more sophisticated), and *somehow* evaluates the effectiveness of the selection.

Normally, the fitness function takes the one-hot vector `c(1,1,0,0,0,0)`

, creates, for example, a random forest model with `var1`

and `var2`

, and returns the fitness value (ROC).

The fitness value in this code calculates is: `ROC value / number of variables`

. By doing this the algorithm penalizes the solutions with a large number of variables. Similar to the idea of Akaike information criterion, or AIC.

My intention is to provide you with a clean code so you can understand what’s behind, while at the same time, try new approaches like modifying the fitness function. This is a crucial point.

To use on your own data set, make sure `data_x`

(data frame) and `data_y`

(factor) are compatible with the `custom_fitness`

function.

The main library is `GA`

. See here the vignette with examples.

**Important**: The following code is incomplete. **Clone the repository** to run the example.

```
# data_x: input data frame
# data_y: target variable (factor)
# GA parameters
param_nBits=ncol(data_x)
col_names=colnames(data_x)
# Executing the GA
ga_GA_1 = ga(fitness = function(vars) custom_fitness(vars = vars,
data_x = data_x,
data_y = data_y,
p_sampling = 0.7), # custom fitness function
type = "binary", # optimization data type
crossover=gabin_uCrossover, # cross-over method
elitism = 0.1, # elitism prob
pmutation = 0.03, # mutation rate prob
popSize = 50, # the number of indivduals/solutions
nBits = param_nBits, # total number of variables
names=col_names, # variable name
run=5, # max iter without improvement (stopping criteria)
maxiter = 50, # total runs or generations
monitor=plot, # plot the result at each iteration
keepBest = TRUE, # keep the best solution at the end
parallel = T, # allow parallel procesing
seed=84211 # for reproducibility purposes
)
```

```
# Checking the results
summary(ga_GA_1)
```

```
── Genetic Algorithm ───────────────────
GA settings:
Type = binary
Population size = 50
Number of generations = 50
Elitism = 0.1
Crossover probability = 0.8
Mutation probability = 0.03
GA results:
Iterations = 15
Fitness function value = 0.1648982
Solution =
radius_mean texture_mean perimeter_mean area_mean smoothness_mean compactness_mean
[1,] 0 0 0 0 0 1
concavity_mean concave points_mean symmetry_mean fractal_dimension_mean ...
[1,] 0 1 0 0
symmetry_worst fractal_dimension_worst
[1,] 0 0
```

```
# Following line will return the variable names of the final and best solution
best_vars_ga=col_names[ga_GA_1@solution[1,]==1]
# Checking the variables of the best solution...
best_vars_ga
```

```
[1] "compactness_mean" "concave points_mean" "symmetry_se" "radius_worst"
[5] "compactness_worst" "concavity_worst"
```

- Blue dot: Population fitness average
- Green dot: Best fitness value

Note: Don’t expect the result that fast

Now we calculate the accuracy based on the best selection!

```
get_accuracy_metric(data_tr_sample = data_x, target = data_y, best_vars_ga)
```

```
[1] 0.9402818
```

The accuracy is around 94%, while the ROC value is closed to 0,95 (ROC=fitness value * number of variables, check the fitness function).

I don’t like to analyze the accuracy without the cutpoint (Scoring Data), but it’s useful to compare with the results of this Kaggle post. He got a similar accuracy result using recursive feature elimination, or RFE.

Try a new fitness function, some solutions still provide a large number of variables, you can try squaring the number of variables.

Another thing to try is the algorithm to get the ROC value, or even to change the metric.

Some configurations last a lot of time. Balance classes before modeling and play with the `p_sampling`

parameter. Sampling techniques can have a big impact on models. Check the Sample size and class balance on model performance post for more info.

How about changing the rate of mutation or elitism? Or trying other cross-over methods?

Increase the `popSize`

to test more possible solutions at the same time (at a time cost).

Feel free to share any insights or ideas to improve the selection.

**Clone the repository** to run the example.

There is a parallelism between GA and Deep Learning, the concept of iteration and improvement over time is similar.

I added the `p_sampling`

parameter to speed up things. And it usually accomplishes its goal. Similar to the *batch* concept used in Deep Learning. Another parallel is between the GA parameter `run`

and the *early stopping* criteria in the neural network training.

But the biggest similarity is both techniques come from **observing the nature**. In both cases, humans observed how neural networks and genetics work, and create a simplified mathematical model that imitate their behavior. Nature has millions of years of evolution, why not try to imitate it?

—

I tried to be brief about GA, but if you have any specific question on this vast topic, please leave it in the comments

*And, if I didn’t motivate you the enough to study GA, check this project which is based on Neuroevolution:*

Thanks for reading

Find me on Twitter and Linkedin.

More blog posts.

Want to learn more? Data Science Live Book

To **leave a comment** for the author, please follow the link and comment on their blog: ** R - Data Science Heroes Blog**.

R-bloggers.com offers

(This article was first published on ** R – intobioinformatics**, and kindly contributed to R-bloggers)

Clusterlab is a CRAN package (https://cran.r-project.org/web/packages/clusterlab/index.html) for the routine testing of clustering algorithms. It can simulate positive (data-sets with >1 clusters) and negative controls (data-sets with 1 cluster). Why test clustering algorithms? Because they often fail in identifying the true K in practice, published algorithms are not always well tested, and we need to know about ones that have strange behaviour. I’ve found in many own experiments on clustering algorithms that algorithms many people are using are not necessary ones that provide the most sensible results. I can give a good example below.

I was interested to see clusterExperiment, a relatively new clustering package on the Bioconductor, cannot detect the ground truth K in my testing so far. Instead yielding solutions with many more clusters than there are in reality. On the other hand, the package I developed with David Watson here at QMUL, does work rather well. It is a derivative of the Monti et al. (2003) consensus clustering algorithm, fitted with a Monte Carlo reference procedure to eliminate overfitting. We called this algorithm M3C.

library(clusterExperiment) library(clusterlab) library(M3C)

**Experiment 1: Simulate a positive control dataset (K=5)**

k5 <- clusterlab(centers=5)

**Experiment 2: Test RSEC (https://bioconductor.org/packages/release/bioc/html/clusterExperiment.html)
**

rsec_test <- RSEC(as.matrix(k5$synthetic_data),ncores=10) assignments <- primaryCluster(rsec_test) pca(k5$synthetic_data,labels=assignments)

**Experiment 3: Test M3C (https://bioconductor.org/packages/release/bioc/html/M3C.html)
**

M3C_test <- M3C(k5$synthetic_data,iters=10,cores=10) optk <- which.max(M3C_test$scores$RCSI)+1 pca(M3C_test$realdataresults[[optk]]$ordered_data, labels=M3C_test$realdataresults[[optk]]$ordered_annotation$consensuscluster,printres=TRUE)

Interesting isn't it R readers.

Well all I can say is I recommend comparing different machine learning methods and if in doubt, run your own control data-sets (positive and negative controls) to test various methods. In the other post we showed a remarkable bias in optimism corrected bootstrapping compared with LOOCV under certain conditions, simply by simulating null data-sets and passing them into the method.

**If you are clustering omic’ data from a single platform (RNAseq, protein, methylation, etc) I have tested and recommend the following packages:**

CLEST: https://www.rdocumentation.org/packages/RSKC/versions/2.4.2/topics/Clest

M3C: https://bioconductor.org/packages/release/bioc/html/M3C.html

And to be honest, that is about it. I have also tested PINSplus, GAP-statistic, and SNF, but they did not provide satisfactory results in my experiments on single platform clustering (currently unpublished data). Multi-omic and single cell RNA-seq analysis is another story, there will be more on that to come in the future R readers.

Remember there is a darkness out there R readers, not just in Washington, but in the world of statistics. It is there because of the positive results bias in science, because of people not checking methods and comparing them with one another, and because of several other reasons I can’t even be bothered to mention.

To **leave a comment** for the author, please follow the link and comment on their blog: ** R – intobioinformatics**.

R-bloggers.com offers

(This article was first published on ** R – Longhow Lam's Blog**, and kindly contributed to R-bloggers)

At the beginning of the new year I always want to clean up my photos on my phone. It just never happens.

So now (like so many others I think) I have a lot of photos on my phone from the last 3.5 years. The iPhone photos app helps you a bit to go through your photos. But which ones are really special and you definitely want to keep?

Well, just apply some machine learning.

- I run all my photos through a VGG16 deep learning model to generate high dimensional features per photo (on my laptop without GPU this takes about 15 minutes for 2000 photos).
- The dimension is 25.088, which is difficult to visualize. I apply a UMAP dimension reduction to bring it back to 3.
- In R you can create an interactive 3D plot with plotly where each point corresponds to a photo. Using crosstalk, you can link it to the corresponding image. The photo appears when you hover over the point.

Well a “special” outlying picture in the scatter plot are my two children with a dog during a holiday a few years ago. I had never found it that fast. There are some other notable things that I can see, but I won’t bother you with it here

Link GitHub repo with to two R scripts to cluster your own photos. Cheers, Longhow

To **leave a comment** for the author, please follow the link and comment on their blog: ** R – Longhow Lam's Blog**.

R-bloggers.com offers

(This article was first published on ** RBlog – Mango Solutions**, and kindly contributed to R-bloggers)

As leading advanced analytics partner for RStudio, Mango Solutions are delighted to be contributing to the upcoming rstudio::conf programme with a workshop and a talk.

Two of Mango’s senior consultants, Aimée Gott, Education Practice Lead and Mark Sellors, Head of Data Engineering will be sharing their R expertise with delegates.

Aimée Gott will be delivering the Intermediate course on Shiny to the RStudio delegates. Used to training the world over, Mango has successfully delivered R and Python training courses, upskilling data science teams giving them the essential skills to enable delivery of data-driven insight. Mark Sellors, Mango’s Head of Data Engineering will be delivering a presentation on ‘R in Production’, as part of the conference programme.

Mark Sellors has years of expertise in Data Engineering and works closely with RStudio to install and configure RStudio’s commercial products. Mango’s relationship with RStudio forms part of their professional services partnership network in data science with R, easing the development environment and ensuring an effective adoption by Data Science teams.

The strategic partnership between Mango and RStudio is based on supporting the functionality and use of the R language and has been in existence for over 10 years. Matt Aldridge, Mango Solutions CEO said of the relationship, ‘it has been pivotal in proving the effectiveness of R as a corporate analytic tool and we are proud at Mango to work alongside their talented and committed team ‘.

Dave Hurst, Director of Business Development at RStudio said ‘We are pleased to have both Aimée and Mark contributing at rstudio::conf (2019). Mango Solutions is an RStudio Full Service Certified Partner, so we know first-hand the depth of expertise and practical experience that Mango offers to our mutual customers who are looking to successfully scale R with RStudio products in production.‘

Mark’s talk on R in production will focus on the increasing trend for data science models and other analytic code and in particular those associated with the application of R. Mark will look at some of the misinformation around the idea of what ‘putting something into production’ actually means as well as provide tips on overcoming the obstacles.

Shiny is an open-source R package that provides an elegant and powerful web framework for building web applications using R. The Shiny training course as part of the conference, has been designed by Shiny author Joe Cheng has been aimed at the experienced Shiny developer. The workshop will improve delegates’ understanding of shiny’s foundations and help them make the most of reactive programming techniques for extending and improving UI and debugging and tools for modularising applications. By the end of the two days, delegates will be able to push the envelope of what a Data Scientist and organisations and can do with Shiny.

Together, RStudio and Mango work with some of the world’s largest companies and are committed to supporting the growth and adoption of R for advanced analytics.

We hope to see you there!

To **leave a comment** for the author, please follow the link and comment on their blog: ** RBlog – Mango Solutions**.

R-bloggers.com offers

(This article was first published on ** bnosac :: open analytical helpers - bnosac :: open analytical helpers**, and kindly contributed to R-bloggers)

Last week the R package ruimtehol was released on CRAN (https://github.com/bnosac/ruimtehol) allowing R users to easily build and apply neural embedding models on text data.

It wraps the ‘StarSpace’ library ">https://github.com/facebookresearch/StarSpace allowing users to calculate** word, sentence, article, document, webpage, link and entity ’embeddings’. By using the ’embeddings’, you can perform text based multi-label classification, find similarities between texts and categories, do collaborative-filtering based recommendation as well as content-based recommendation, find out relations between entities, calculate graph ’embeddings’ as well as perform semi-supervised learning and multi-task learning on plain text**. The techniques are explained in detail in the paper: ‘StarSpace: Embed All The Things!’ by Wu et al. (2017), available at https://arxiv.org/abs/1709.03856.

You can get started with some common text analytical use cases by using the presentation we have built below. Enjoy!

{aridoc engine=”pdfjs” width=”100%” height=”550″}images/bnosac/blog/R_TextMining_Starspace.pdf{/aridoc}

If you like it, give it a star at https://github.com/bnosac/ruimtehol and if you need commercial support on text mining, get in touch.

Note also that you might be interested in the following courses held in Belgium

- 21-22/02/2018: Advanced R programming. Leuven (Belgium). Subscribe here
- 13-14/03/2018: Computer Vision with R and Python. Leuven (Belgium). Subscribe here
- 15/03/2019: Image Recognition with R and Python: Subscribe here
**01-02/04/2019: Text Mining with R. Leuven (Belgium). Subscribe here**

To **leave a comment** for the author, please follow the link and comment on their blog: ** bnosac :: open analytical helpers - bnosac :: open analytical helpers**.

R-bloggers.com offers

(This article was first published on ** R-Bloggers – Learning Machines**, and kindly contributed to R-bloggers)

Everything “neural” is (again) the latest craze in machine learning and artificial intelligence. Now what is the magic here?

Let us dive directly into a (supposedly little silly) example: we have three protagonists in the fairy tail little red riding hood, the wolf, the grandmother and the woodcutter. They all have certain qualities and little red riding hood reacts in certain ways towards them. For example the grandmother has big eyes, is kindly and wrinkled – little red riding hood will approach her, kiss her on the cheek and offer her food (the behavior “flirt with” towards the woodcutter is a little sexist but we kept it to reproduce the original example from Jones, W. & Hoskins, J.: Back-Propagation, Byte, 1987). We will build and train a neural network which gets the qualities as inputs and little red riding wood’s behaviour as output, i.e. we train it to learn the adequate behaviour for each quality.

Have a look at the following code and its output including the resulting plot:

library(neuralnet) library(NeuralNetTools) # code qualities and actions qualities <- matrix (c(1, 1, 1, 0, 0, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1), byrow = TRUE, nrow = 3) colnames(qualities) <- c("big_ears", "big_eyes", "big_teeth", "kindly", "wrinkled", "handsome") rownames(qualities) <- c("wolf", "grannie", "woodcutter") qualities ## big_ears big_eyes big_teeth kindly wrinkled handsome ## wolf 1 1 1 0 0 0 ## grannie 0 1 0 1 1 0 ## woodcutter 1 0 0 1 0 1 actions <- matrix (c(1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 1, 0, 1, 1), byrow = TRUE, nrow = 3) colnames(actions) <- c("run_away", "scream", "look_for_woodcutter", "kiss_on_cheek", "approach", "offer_food", "flirt_with") rownames(actions) <- rownames(qualities) actions ## run_away scream look_for_woodcutter kiss_on_cheek approach offer_food flirt_with ## wolf 1 1 1 0 0 0 0 ## grannie 0 0 0 1 1 1 0 ## woodcutter 0 0 0 1 0 1 1 data <- cbind(qualities, actions) # train the neural network (NN) set.seed(123) # for reproducibility neuralnetwork <- neuralnet(run_away + scream+look_for_woodcutter + kiss_on_cheek + approach + offer_food + flirt_with ~ big_ears + big_eyes + big_teeth + kindly + wrinkled + handsome, data = data, hidden = 3, exclude = c(1, 8, 15, 22, 26, 30, 34, 38, 42, 46), lifesign = "minimal", linear.output = FALSE) ## hidden: 3 thresh: 0.01 rep: 1/1 steps: 48 error: 0.01319 time: 0.01 secs # plot the NN par_bkp <- par(mar = c(0, 0, 0, 0)) # set different margin to minimize cutoff text plotnet(neuralnetwork, bias = FALSE)

par(par_bkp) # predict actions round(compute(neuralnetwork, qualities)$net.result) ## [,1] [,2] [,3] [,4] [,5] [,6] [,7] ## wolf 1 1 1 0 0 0 0 ## grannie 0 0 0 1 1 1 0 ## woodcutter 0 0 0 1 0 1 1

First the qualities and the actions are coded as binary variables in a data frame. After that the neural network is being trained with the qualities as input and the resulting behaviour as output (using the standard formula syntax). In the `neuralnet`

function a few additional technical arguments are set which details won’t concern us here, they just simplify the process in this context). Then we plot the learned net and test it by providing it with the respective qualities: in all three cases it predicts the right actions. How did it learn those?

Let us look at the plot of the net. We see that there are two basic building blocks: neurons and weighted connections between them. We have one neuron for each quality and one neuron for each action. Between both layers we have a so called hidden layer with three neurons in this case. The learned strength between the neurons is shown by the thickness of the lines (whereby ‘black’ means positive and ‘grey’ negative weights). Please have a thorough look at those weights.

You might have noticed that although the net didn’t know anything about the three protagonists in our little story it nevertheless correctly built a representation of them: ‘H1’ (for Hidden 1) represents the wolf because its differentiating quality is ‘big teeth’ which leads to ‘run away’, ‘scream’ and ‘look for woodcutter’, by the same logic ‘H2’ is the woodcutter and ‘H3’ is the grandmother. So the net literally learned to connect the qualities with respective actions of little red riding hood by creating a representation of the three protagonists!

So an artificial neural network is obviously a network of neurons… so let us have a look at those neurons! Basically they are mathematical abstractions of real neurons in your brain. They consist of inputs and an output. The biologically inspired idea is that when the activation of the inputs surpasses a certain threshold the neuron fires. To be able to learn the neuron must, before summing up the inputs, adjust the inputs so that the output is not just arbitrary but matches some sensible result. What is ‘sensible’ you might ask. In a biological environment the answer is not always so clear cut but in our simple example here the neuron has just to match the output we provide it with (= supervised learning).

The following abstraction has all we need, inputs, weights, the sum function, a threshold after that and finally the output of the neuron:

Let us talk a little bit about what is going on here intuitively. First every input is taken, multiplied by its weight and all of this is summed up. Some of you might recognize this mathematical operation as a scalar product (also called dot product). Another mathematical definition of a scalar product is the following:

That is we multiply the length of two vectors by the cosine of the angle of those two vectors. What has cosine to do with it? The cosine of an angle becomes one when both vectors point into the same direction, it becomes zero when they are orthogonal and minus one when both point into opposite directions. Does this make sense? Well, I give you a litte (albeit crude) parable. When growing up there are basically three stages: first you are totally dependent on your parents, then comes puberty and you are against whatever they say or think and after some years you are truly independent (some never reach that stage…). What does Independent mean here? It means that you agree with some of the things your parents say and think and you disagree with some other things. During puberty you are as dependent on your parents as during being a toddler – you just don’t realize that but in reality you, so to speak, only multiply everything your parents say or think times minus one!

What is the connection with cosine? Well, as a toddler both you and your parents tend to be aligned which gives one, during puberty both of you are aligned but in opposing directions which gives minus one and only as a grown up you are both independent which mathematically means that your vector in a way points in both directions at the same time which is only possible when it is orthogonal on the vector of your parents (you entered a new dimension, literally) – and that gives zero for the cosine.

So cosine is nothing but a measure of dependence – as is correlation by the way. So this setup ensures that the neuron learns the dependence (or correlation) structure between the inputs and the output! The step function is just a way to help it to decide on which side of the fence it wants to sit, to make the decision clearer whether to fire or not. To sum it up, an artificial neuron is a non-linear function (in this case a step function) on a scalar product of the inputs (fixed) and the weights (adaptable to be able to learn). By adapting the weights the neuron learns the dependence structure between inputs and output.

In R you code this idea of an artificial neuron as follows:

neuron <- function(input) ifelse(weights %*% input > 0, 1, 0)

Now let us use this idea in R by training an artificial neuron to classify points in a plane. Have a look at the following table:

Input 1 | Input 2 | Output |
---|---|---|

1 | 0 | 0 |

0 | 0 | 1 |

1 | 1 | 0 |

0 | 1 | 1 |

If you plot those points with the colour coded pattern you get the following picture:

The task for the neuron is to find a separating line and thereby classify the two groups. Have a look at the following code:

# inspired by Kubat: An Introduction to Machine Learning, p. 72 plot_line <- function(w, col = "blue", add = TRUE) curve(-w[1] / w[2] * x - w[3] / w[2], xlim = c(-0.5, 1.5), ylim = c(-0.5, 1.5), col = col, lwd = 3, xlab = "Input 1", ylab = "Input 2", add = add) neuron <- function(input) as.vector(ifelse(input %*% weights > 0, 1, 0)) # step function on scalar product of weights and input eta <- 0.5 # learning rate # examples input <- matrix(c(1, 0, 0, 0, 1, 1, 0, 1), ncol = 2, byrow = TRUE) input <- cbind(input, 1) # bias for intercept of line output <- c(0, 1, 0, 1) weights <- c(0.25, 0.2, 0.35) # random initial weights plot_line(weights, add = FALSE); grid() points(input[ , 1:2], pch = 16, col = (output + 2)) # training of weights of neuron for (example in 1:length(output)) { weights <- weights + eta * (output[example] - neuron(input[example, ])) * input[example, ] plot_line(weights) } plot_line(weights, col = "black")

# test: applying neuron on input apply(input, 1, neuron) ## [1] 0 1 0 1

As you can see the result matches the desired output, graphically the black line is the end result and as you can see it separates the green from the red points: the neuron has learned this simple classification task. The blue lines are where the neuron starts from and where it is during training – they are not able to classify the points correctly.

The training, i.e. adapting the weights, takes places in this line:

weights <- weights + eta * (output[example] - neuron(input[example, ])) * input[example, ]

The idea is to compare the current output of the neuron with the wanted output, scale that by some learning factor (eta) and modify the weights accordingly. So if the output is too big make the weights smaller and vice versa. Do this for all examples (sometimes you need another loop to train the neuron with the examples several times) and that’s it. That is the core idea behind the ongoing revolution of neural networks!

Ok, so far we had a closer look at one part of neural networks, namely the neurons, let us now turn to the network structure (also called network topology). First, why do we need a whole network anyway when the neurons are already able to solve classification tasks? The answer is that they can do that only for very simple problems. For example the neuron above can only distinguish between linearly separable points, i.e. it can only draw lines. It fails in case of the simple problem of four points that are coloured green, red, red, green from top left to bottom right (try it yourself). We would need a non-linear function to separate the points. We have to combine several neurons to solve more complicated problems.

The biggest problem you have to overcome when you combine several neurons is how to adapt all the weights. You need a system how to attribute the error at the output layer to all the weights in the net. This had been a profound obstacle until an algorithm called backpropagation (also abbreviated backprop) was invented (or found). We won’t get into the details here but the general idea is to work backwards from the output layers through all of the hidden layers till one reaches the input layer and modify the weights according to their respective contribution to the resulting error. This is done several (sometimes millions of times) for all training examples until one achieves an acceptable error rate for the training data.

The result is that you get several layers of abstraction, so when you e.g. want to train a neural network to recognize certain faces you start with the raw input data in the form of pixels, these are automatically combined into abstract geometrical structures, after that the net detects certain elements of faces, like eyes and noses, and finally abstractions of certain faces are being rebuilt by the net. See the following picture (from nivdul.wordpress.com) for an illustration:

So far we have only coded very small examples of a neural networks. Real-world examples often have dozens of layers with thousands of neurons so that much more complicated patterns can be learned. The more layers there are the ‘deeper’ a net becomes… which is the reason why the current revolution in this field is called “deep learning” because there are so many hidden layers involved. Let us now look at a more realistic example: predicting whether a breast cell is malignant or benign.

Have a look at the following code:

library(OneR) data(breastcancer) data <- breastcancer colnames(data) <- make.names(colnames(data)) data$Class <- as.integer(as.numeric(data$Class) - 1) # for compatibility with neuralnet data <- na.omit(data) # Divide training (80%) and test set (20%) set.seed(12) # for reproducibility random <- sample(1:nrow(data), 0.8 * nrow(data)) data_train <- data[random, ] data_test <- data[-random, ] # Train NN on training set f <- reformulate(setdiff(colnames(data), "Class"), response = "Class") # for compatibility with neuralnet (instead of Class ~.) model_train <- neuralnet(f, data = data_train, hidden = c(9, 9), lifesign = "minimal") ## hidden: 9, 9 thresh: 0.01 rep: 1/1 steps: 3784 error: 0.00524 time: 3.13 secs # Plot net plot(model_train, rep = "best")

# Use trained model to predict test set prediction <- round(compute(model_train, data_test[ , -10])$net.result) eval_model(prediction, data_test) ## ## Confusion matrix (absolute): ## Actual ## Prediction 0 1 Sum ## 0 93 2 95 ## 1 4 38 42 ## Sum 97 40 137 ## ## Confusion matrix (relative): ## Actual ## Prediction 0 1 Sum ## 0 0.68 0.01 0.69 ## 1 0.03 0.28 0.31 ## Sum 0.71 0.29 1.00 ## ## Accuracy: ## 0.9562 (131/137) ## ## Error rate: ## 0.0438 (6/137) ## ## Error rate reduction (vs. base rate): ## 0.85 (p-value = 0.0000000000001297637)

So you see that a relatively simple net achieves an accuracy of about 95% out of sample. The code itself should be mostly self-explanatory. For the actual training the `neuralnet`

function from the package with the same name is being used. The input method is the R formula interface with the twist that the normally used shortcut `~.`

(for all variables except the given dependent one) isn’t supported and a workaround has to used (which makes part of the code a little clumsy). Also, a little bit unconventional is the name of the predict function, it is called `compute`

. And you have to make sure that the input only consists of the variables you used for building the model, otherwise you will get an error.

When you look at the net one thing might strike you as odd: there are three neurons at the top with a fixed value of 1. These are so called bias neurons and they serve a similar purpose as the intercept in a linear regression: they kind of shift the model as a whole in n-dimensional feature space just as a regression line is being shifted by the intercept. In case you were attentive we also smuggled in a bias neuron in the above example of a single neuron: it is the last column of the input matrix which contains only ones.

Another thing: as can even be seen in this simple example it is very hard to find out what a neural network has actually learned – the following well-known anecdote (urban legend?) shall serve as a warning: some time ago the military built a system which had the aim to distinguish military vehicles from civilian ones. They chose a neural network approach and trained the system with pictures of tanks, humvees and missile launchers on the one hand and normal cars, pickups and lorries on the other. After having reached a satisfactory accuracy they brought the system into the field (quite literally). It failed completely, performing no better than a coin toss. What had happened? No one knew, so they re-engineered the black box (no small feat in itself) and found that most of the military pics where taken at dusk or dawn and most civilian pics under brighter weather conditions. The neural net had learned the difference between light and dark!

Just for comparison the same example with the OneR package:

data(breastcancer) data <- breastcancer # Divide training (80%) and test set (20%) set.seed(12) # for reproducibility random <- sample(1:nrow(data), 0.8 * nrow(data)) data_train <- optbin(data[random, ], method = "infogain") ## Warning in optbin.data.frame(data[random, ], method = "infogain"): 12 ## instance(s) removed due to missing values data_test <- data[-random, ] # Train OneR model on training set model_train <- OneR(data_train, verbose = TRUE) ## ## Attribute Accuracy ## 1 * Uniformity of Cell Size 92.32% ## 2 Uniformity of Cell Shape 91.59% ## 3 Bare Nuclei 90.68% ## 4 Bland Chromatin 90.31% ## 5 Normal Nucleoli 90.13% ## 6 Single Epithelial Cell Size 89.4% ## 7 Marginal Adhesion 85.92% ## 8 Clump Thickness 84.28% ## 9 Mitoses 78.24% ## --- ## Chosen attribute due to accuracy ## and ties method (if applicable): '*' # Show model and diagnostics summary(model_train) ## ## Call: ## OneR.data.frame(x = data_train, verbose = TRUE) ## ## Rules: ## If Uniformity of Cell Size = (0.991,2] then Class = benign ## If Uniformity of Cell Size = (2,10] then Class = malignant ## ## Accuracy: ## 505 of 547 instances classified correctly (92.32%) ## ## Contingency table: ## Uniformity of Cell Size ## Class (0.991,2] (2,10] Sum ## benign * 318 30 348 ## malignant 12 * 187 199 ## Sum 330 217 547 ## --- ## Maximum in each column: '*' ## ## Pearson's Chi-squared test: ## X-squared = 381.78243, df = 1, p-value < 0.00000000000000022204 # Plot model diagnostics plot(model_train)

# Use trained model to predict test set prediction <- predict(model_train, data_test) # Evaluate model performance on test set eval_model(prediction, data_test) ## ## Confusion matrix (absolute): ## Actual ## Prediction benign malignant Sum ## benign 92 0 92 ## malignant 8 40 48 ## Sum 100 40 140 ## ## Confusion matrix (relative): ## Actual ## Prediction benign malignant Sum ## benign 0.66 0.00 0.66 ## malignant 0.06 0.29 0.34 ## Sum 0.71 0.29 1.00 ## ## Accuracy: ## 0.9429 (132/140) ## ## Error rate: ## 0.0571 (8/140) ## ## Error rate reduction (vs. base rate): ## 0.8 (p-value = 0.000000000007992571)

As you can see the accuracy is only slightly worse but you have full interpretability of the model… and you would only need to measure one value (“Uniformity of Cell Size”) instead of 9 to get a prediction!

On the other hand making neural networks interpretable is one of the big research challenges at the moment.

To end this rather long post: there is a real revolution going on at the moment with all kinds of powerful neural networks. Especially promising is a combination of reinforcement learning (the topic of an upcoming post) and neural networks, where the reinforcement learning algorithm uses a neural network as its memory. For example the revolutionary AlphaGo Zero is built this way: it just received the rules of Go, one of the most demanding strategy games humanity has ever invented, and grew superhuman strength after just three days! The highest human rank in Go has an ELO value of 2940 – AlphaGo Zero achieves 5185! Even the best players don’t stand a chance against this monster of a machine. The neural network technology that is used for AlphaGo Zero and many other deep neural networks is called Tensorflow, which can also easily be integrated into the R environment. To find out more go here: https://tensorflow.rstudio.com/

In this whole area there are many mind-blowing projects underway, so stay tuned!

To **leave a comment** for the author, please follow the link and comment on their blog: ** R-Bloggers – Learning Machines**.

R-bloggers.com offers

(This article was first published on ** Digital Age Economist on Digital Age Economist**, and kindly contributed to R-bloggers)

This is the second installment in a three part series on integrating H2O, AWS and p(f)urrr. In Part II, I will showcase how we can combine `purrr`

and `h2o`

to train and stack ML models. In the first post we looked at starting up an AMI on AWS which acts as the infrastructure upon which we will model.

Part one of the post can be found here

The packages we will use in this post:

`library(tidyverse)`

`library(h2oEnsemble)`

`library(caret)`

`library(mlbench)`

In this post I will be using the `r3.4xlarge`

machine we started up in Part I of this series. The code below starts up a H2O instance that uses all available cores and limits the instance’s memory to `100GB`

.

```
h2o.init(nthreads = -1, max_mem_size = '100g')
```

We can see that once the server is up, we have `88.89 GB`

RAM and 16 Cores available. Also take note of the very handy logging file: `/tmp/RtmpKLRj1u/h2o_rstudio_started_from_r.out`

.

We will use the `Higgs`

data to experiment on:

Using simulated data with features characterizing events detected by ATLAS, your task is to classify events into “tau decay of a Higgs boson” versus “background.”

As per all machine learning application, we need to split our data. We will use the `h2o.splitFrame`

function to split the data into training, validation and test sets. What is very important to notice in the function is the `destination_frame`

parameter as the datasets are being transformed to `.hex`

format which is the required format for the H2O server.

```
train <- read_csv("https://s3.amazonaws.com/erin-data/higgs/higgs_train_5k.csv")
test <- read_csv("https://s3.amazonaws.com/erin-data/higgs/higgs_test_5k.csv")
higgs <- rbind(train, test) %>%
mutate(response = as.factor(response))
set.seed(107)
splits <- h2o.splitFrame(
data = as.h2o(higgs, destination_frame = "higgs.hex"),
ratios = c(0.6, 0.2), ## only need to specify 2 fractions, the 3rd is implied
destination_frames = c("train.hex", "valid.hex", "test.hex"), seed = 1234
)
training <- splits[[1]]
valid <- splits[[2]]
testing <- splits[[3]]
```

The split frames are now of the correct class `H2OFrame`

. Next we need to specify the outcome and predictor columns:

```
Y <- 'response'
X <- setdiff(names(training), Y)
```

To test the basic functionality of `H2O`

, lets train a `gbm`

to predict `signal`

against `background`

noise for the Higgs dataset.

```
higgs.gbm <- h2o.gbm(y = Y, x = X, training_frame = training, ntrees = 5,
max_depth = 3, min_rows = 2, learn_rate = 0.2,
distribution = "AUTO")
higgs.gbm
```

This very basic specification of the model, tuning = {`ntrees = 5`

, `max_depth = 3`

, `learning_rate = 0.2`

}, gives us an AUC of ~77%. Lets see how the model does on the validation set:

```
h2o.performance(higgs.gbm, valid)
```

From the validation set we can see that the models don’t perform too differently. It correctly predicts `signal`

vs `background`

with about 35.5% prediction error and AUC of 73%. For the rest of this post we will not be concerning ourselves with models performance, but rather focus on how we can build nice pipelines using `H2O`

and `purrr`

.

Having a basic understanding of the `H2O`

api, lets ramp it up to train multiple models using `purrr`

. The first step is to create a function that contains all the `H2O`

models we want to use in our modeling design. Our wrapper should contain basic parameters: `training_frame`

, `validation_frame`

, `nfolds`

and `Y`

/`X`

variable names.

A very important feature in our new wrapper you should notice is `fold_assignment = "Modulo"`

and `keep_cross_validation_predictions = TRUE`

parameters. We need these as we want to stack the models at a later stage to check if it improves performance^{1}.

```
h2o.models <- function(df, validation, nfolds, model, Y, X){
if(model == "gbm"){
return(h2o.gbm(y = Y, x = X,
training_frame = df,
validation_frame = validation,
distribution = "bernoulli",
nfolds = nfolds,
fold_assignment = "Modulo",
keep_cross_validation_predictions = TRUE))
}
if(model == "rf"){
return(h2o.randomForest(y = Y, x = X,
training_frame = df,
validation_frame = validation,
distribution = "bernoulli",
nfolds = nfolds,
fold_assignment = "Modulo",
keep_cross_validation_predictions = TRUE))
}
if(model == "xgboost"){
return(h2o.xgboost(y = Y, x = X,
training_frame = df,
validation_frame = validation,
distribution = "bernoulli",
nfolds = nfolds,
fold_assignment = "Modulo",
keep_cross_validation_predictions = TRUE))
}
if(model == "deeplearning"){
return(h2o.deeplearning(y = Y, x = X,
training_frame = df,
validation_frame = validation,
distribution = "bernoulli",
nfolds = nfolds,
fold_assignment = "Modulo",
keep_cross_validation_predictions = TRUE))
}
}
```

This gives a nice wrapper for use with `purrr`

’s `map`

function. Lets create a tibble that has the necessary layout so we can apply our new function across the dataset. You do not necessarily need to add the training, validation and testing frames into the tibble, but it helps in the long-run, especially if they start differing somehow^{2}. It also helps for auditing the models:

```
trained_ML <- tibble(Models = c("gbm", "rf", "deeplearning", "xgboost")) %>%
mutate(
training = rep(list(training), nrow(.)),
valid = rep(list(valid), nrow(.))
)
trained_ML
```

Here we see that we have a special column, `Models’, specifically naming the algorithm assigned to a training and test set.

Lets now use the `h2o.models`

function by training all the different models and evaluating their performance.

```
trained_ML <- tibble(Models = c("gbm", "rf", "deeplearning", "xgboost")) %>%
mutate(
training = rep(list(training), nrow(.)),
valid = rep(list(valid), nrow(.))
) %>%
# pmap to apply the h2o functional across each row
mutate(trained_models = pmap(list(training, valid, Models), ~h2o.models(..1, ..2, ..3, nfolds = 5, Y, X))) %>%
# once trained, predict against the test set with respective model gbm etc
mutate(predictions = map(trained_models, h2o.predict, testing)) %>%
# create new column containing in-sample confusionMatrix
mutate(confusion = map(trained_models, h2o.confusionMatrix)) %>%
# create new column containing test set performance metrics
mutate(performance = map2(trained_models, valid, ~h2o.performance(.x, .y)))
trained_ML
```

As a last measure, we might not want to have the performance metrics nested in the tibble frame. So to lift the measures out of the performance list, we can build our own function to extract the performance metrics we might need in evaluating each of the model’s performance:

```
f_perf <- function(performance){
tibble(errors = tail(performance@metrics$cm$table$Error, 1),
logloss = performance@metrics$logloss,
rmse = performance@metrics$RMSE,
r_sqaured = performance@metrics$r2,
pr_auc = performance@metrics$pr_auc,
gini = performance@metrics$Gini)
}
trained_ML <- trained_ML %>%
bind_cols(map_df(.$performance, f_perf))
```

This new tibble gives us a very good idea on which model performs the best, if we base it off the `logloss`

measure of fit: `gbm`

. From here we could build a wrapper around different `gbm`

models and use the same framework to train multiple different types of `gbm`

specified models if we wanted to do so.

The last bit of performance improvement we attempt in any model process is stacking. Model stacking is the idea that using the predictions of the base learners (in our case the `rf`

, `gbm`

etc), and then using a super learner or meta learner that takes as input, the base learners predictions/CV results and runs a model across those results as the final prediction. There are some interesting papers which talk to this idea, one by Leo Breiman^{3} himself or another one by van der Laan, Polley, and Hubbard 2007 that statistically proves why these *Super Learners* work.

As my metalearner in this example I will use a `glm`

:

```
models <- trained_ML %>%
pull(trained_models)
metalearner <- c("h2o.glm.wrapper")
stack <- h2o.stack(models = models,
response_frame = training[,Y],
metalearner = metalearner,
seed = 1,
keep_levelone_data = TRUE)
# Compute test set performance:
perf <- h2o.ensemble_performance(stack, newdata = testing)$ensemble
stack_df <- tibble(Models = "stack",
trained_models = list(stack),
performance = list(perf)) %>%
bind_cols(map_df(.$performance, f_perf))
bind_rows(trained_ML,stack_df)
```

Our stack of the models do not seem to improve the `logloss`

performance measure in this specific case. That being said, we haven’t spent a lot of time trying to tune any of the models.

The last bit of code we need to finish this pipeline is to shutdown the H2O server:

```
h2o.ls()
h2o.removeAll()
h2o.shutdown()
```

As you can see, combining the Many-models idea with the engineering of H2O and AWS, gives us a very nice framework with which we can run a crazy amount of experiments at a low cost and within a short time period. In writing this blog, I spent a total of 2 hours and 40 minutes running the models multiple times, getting the stack to work and overall just reading some H2O documentation. While all this was happening our `r3.4xlarge`

– 16 core, 100GB RAM machine was running, resulting in a total cost for this blog at around $0.42, give or take. Which I think isn’t too bad for the resources that we used.

In the last and final part of the series, we will explore how to initialize a H2O AWS server using `furrr`

and a basic script. This means that you can essentially write the code on your home setup, use `boto3`

from `python`

to spin up an `EC2`

instance, conduct the analysis and only return to your local environment the most important results.

To **leave a comment** for the author, please follow the link and comment on their blog: ** Digital Age Economist on Digital Age Economist**.

R-bloggers.com offers