(This article was first published on ** Shirin's playgRound**, and kindly contributed to R-bloggers)

I am very happy to announce that (after many months) my interactive course on Hyperparameter Tuning in R has now been officially launched on Data Camp!

For many machine learning problems, simply running a model out-of-the-box and getting a prediction is not enough; you want the best model with the most accurate prediction. One way to perfect your model is with hyperparameter tuning, which means optimizing the settings for that specific model. In this course, you will work with the caret, mlr and h2o packages to find the optimal combination of hyperparameters in an efficient manner using grid search, random search, adaptive resampling and automatic machine learning (AutoML). Furthermore, you will work with different datasets and tune different supervised learning models, such as random forests, gradient boosting machines, support vector machines, and even neural nets. Get ready to tune!

You can take the course here:

To **leave a comment** for the author, please follow the link and comment on their blog: ** Shirin's playgRound**.

R-bloggers.com offers

(This article was first published on ** Revolutions**, and kindly contributed to R-bloggers)

*A monthly roundup of news about Artificial Intelligence, Machine Learning and Data Science. This is an eclectic collection of interesting blog posts, software announcements and data applications from Microsoft and elsewhere that I've noted over the past month or so.*

Preview of Tensorflow 2.0 (the public preview is expected "early this year").

pcLasso, an R package implementing a new method for supervised learning described by co-author Rob Tibshirani as "principal components regression meets the lasso".

R 3.5.2 has been released.

A retrospective of Google Research activities in AI in 2018.

Google Cloud Platform now supports R jobs (via SparkR) in Cloud DataProc.

GCP App Engine support for Python 3.7 now generally available.

Amazon SageMaker now comes preconfigured with support for SciKit-learn.

Microsoft Professional Program for Data Analysis, a new online course and certification. Other MPP tracks include Data Science and Artificial Intelligence.

A single API key can now be used to access Cognitive Services APIs for language, vision and search.

AzureR, a suite of packages for interfacing with storage, virtual machines, containers and other Azure services from the R language, is now available on CRAN.

E-book by TWIML AI host Sam Charrington: Kubernetes for Machine Learning, Deep Learning and AI (requires free sign-up).

E-book by Patrick Hall and Navdeep Gill: Introduction to Machine Learning Interpretability (requires free registration).

An in-depth introduction to convolutional neural networks, from Ars Technica: How computers got shockingly good at recognizing images.

An on-line course from Databricks, Deep Learning Fundamental Series, with a focus on Keras and TensorFlow.

Getting Started with TensorFlow Probability, from R, a blog post from RStudio.

What can Neural Networks Learn?, an approachable look at the inner workings of neural network classifiers from Brandon Rohrer.

A blog post describing the computational graph concepts behind TensorFlow, including an illustrative implementation of core TensorFlow operations in numpy.

Using computer vision to monitor shelf stocking policies, a three-part series on complex image classification: Part 1, Part 2, Part 3.

Building a pet breed identification application via transfer learning with Azure ML Service and Python. (Github repo here.)

*Find previous editions of the monthly AI roundup here.*

To **leave a comment** for the author, please follow the link and comment on their blog: ** Revolutions**.

R-bloggers.com offers

(This article was first published on ** Emma R**, and kindly contributed to R-bloggers)

I used DataCamp‘s excellent Introduction to R as Essential Prior Independent Study and found it made people a bit less worried about a term of R!

I have a lot of fun teaching first year biology undergraduates but there are a few challenges in teaching data skills when they are not (perceived as) a student’s core discipline but instead required to carry out research within it. At this early stage in their higher education, Biologists can be surprised by the amount of their degree devoted to data analysis, reporting and presentation.

In my introductory lecture I use polling software to get responses from my students to:

As you can see, my students don’t mind making their feelings clear!

Those are the results from the last two years – if anything, this year’s students are more sure they won’t enjoy it! I suspect this is not the result my colleagues teaching Genetics, Evolution, Cell Biology or Development would get if they asked the same. And understandably, I think.

This year I set Essential Prior Independent Study using the ability to set “Assignments” for a team in my DataCamp‘ classroom. I had them do only the first two chapters (Intro to basics and Vectors) of Introduction to R. Last year I suggested DataCamp as an optional activity and used part of it in an introductory workshop.

I was delighted to see that well over half the class of 256 students had started or completed the assignments before the lecture despite the assignment deadline still being a day away. And there was more……..when I asked how they felt about R, they were more positive than last year:

How good is that? Many ‘Seems Ok’ and ‘Undecided’ and more students excited than terrified is a win!

Well done them!

To **leave a comment** for the author, please follow the link and comment on their blog: ** Emma R**.

R-bloggers.com offers

(This article was first published on ** R Programming – DataScience+**, and kindly contributed to R-bloggers)

Categories

Tags

In this article, you learn how to make Automated Dashboard for Credit Modelling with Decision trees and Random forests in R. First you need to install the `rmarkdown` package into your R library. Assuming that you installed the `rmarkdown`, next you create a new `rmarkdown` script in R.

After this you type the following code in order to create a dashboard with `rmarkdown`

and `flexdashboard`

:

--- title: "Automated Dashboard for Credit Modelling with Decision trees and Random forests in R" author: "Kristian Larsen" output: flexdashboard::flex_dashboard: orientation: rows vertical_layout: scroll --- ```{r setup, include=FALSE} # Data management packages library(flexdashboard) library(dplyr) library(caret) library(partykit) library(randomForest) library(Hmisc) knitr::opts_chunk$set(cache=TRUE) options(scipen = 9999) rm(list=ls()) # Read dataset loans <- read.csv("http://www.sci.csueastbay.edu/~esuess/classes/Statistics_6620/Presentations/ml7/credit.csv") str(loans) # Data management # Change the order/level of checking_balance variable loans$checking_balance <- factor(loans$checking_balance, levels = c(" 200 DM", "unknown")) summary(loans[loans$default == "yes", "checking_balance"]) # Change the order/level of saving_balance variable loans$savings_balance <- factor(loans$savings_balance, levels = c(" 1000 DM", "unknown")) summary(loans[loans$default == "yes", "savings_balance"]) # Change the order/level of credit_history variable loans$credit_history <- factor(loans$credit_history, levels = c("critical", "poor", "good", "very good", "perfect")) summary(loans[loans$default == "yes", "credit_history"]) # Change the order/level of other_credit variable loans$other_credit <- factor(loans$other_credit, levels = c("none", "store", "bank")) summary(loans[loans$default == "yes", "other_credit"]) set.seed(300) in_loans_train <- sample(nrow(loans), nrow(loans)*0.75) loans_train <- loans[in_loans_train, ] loans_test <- loans[-in_loans_train, ] ``` Row {data-width=350} ----------------------------------------------------------------------- ### Chart A - Decision tree Model I ```{r} loans_model_dt <- ctree(default ~ ., loans_train) plot(loans_model_dt) ``` ### Chart B - Decision tree Model I - simple ```{r} plot(loans_model_dt, type = "simple") ``` Row {data-width=650} ----------------------------------------------------------------------- ### Chart C - Decision tree Model Model I - formula ```{r} loans_model_dt ``` Row {data-width=650} ----------------------------------------------------------------------- ### Chart D - Confusion Matrix for Decision tree Model Model I ```{r} loans_pred_dt <- predict(loans_model_dt, loans_test) dt_conft <- table("prediction" = loans_pred_dt, "actual" = loans_test$default ) accu_dt <- round((dt_conft[1]+dt_conft[4])/sum(dt_conft[1:4]),4) prec_dt <- round(dt_conft[4]/(dt_conft[2]+dt_conft[4]), 4) reca_dt <- round(dt_conft[4]/(dt_conft[4]+dt_conft[3]), 4) spec_dt <- round(dt_conft[1]/(dt_conft[1]+dt_conft[2]), 4) confusionMatrix(loans_pred_dt, loans_test$default, positive = "yes") ``` Row {data-width=650} ----------------------------------------------------------------------- ### Chart E - Decision tree Model II ```{r} loans_model_dt2 <- ctree(default ~ ., loans_train, control = ctree_control(mincriterion = 0.7)) plot(loans_model_dt2) ``` Row {data-width=650} ----------------------------------------------------------------------- ### Chart F - Decision tree Model Model II - formula ```{r} loans_model_dt2 ``` Row {data-width=650} ----------------------------------------------------------------------- ### Chart G - Confusion Matrix for Decision tree Model Model II ```{r} loans_pred_dt2 <- predict(loans_model_dt2, loans_test) confusionMatrix(loans_pred_dt2, loans_test$default, positive = "yes") ``` Row {data-width=650} ----------------------------------------------------------------------- ### Chart H - Random Forest Model ```{r} set.seed(300) ctrl <- trainControl(method = "repeatedcv", number = 10, repeats = 3, allowParallel = TRUE) loans_rf <- train(default ~ ., data = loans, method = "rf", trControl = ctrl) loans_rf$finalModel ``` Row {data-width=650} ----------------------------------------------------------------------- ### Chart I - Random Forest Model - variable importance ```{r} varImp(loans_rf) ``` Row {data-width=350} ----------------------------------------------------------------------- ### Chart J - Random Forest Model - Final model plot I ```{r} plot(loans_rf$finalModel) legend("topright", colnames(loans_rf$finalModel$err.rate),col = 1:6, cex = 0.8, fill = 1:6) ``` ### Chart K - Random Forest Model - Final model plot II ```{r} plot(loans_rf) ```

The result of the above coding are published with RPubs here.

Related Post

- Automated Dashboard for Classification Neural Network in R
- How to Achieve Parallel Processing in Python Programming
- Recommender System for Christmas in Python
- Automated Dashboard visualizations with Time series visualizations in R
- Automated Dashboard visualizations with distribution in R

To **leave a comment** for the author, please follow the link and comment on their blog: ** R Programming – DataScience+**.

R-bloggers.com offers

(This article was first published on ** Shirin's playgRound**, and kindly contributed to R-bloggers)

These are slides from a lecture I gave at the School of Applied Sciences in Münster. In this lecture, I talked about **Real-World Data Science** and showed examples on **Fraud Detection, Customer Churn & Predictive Maintenance**.

The slides were created with xaringan.

To **leave a comment** for the author, please follow the link and comment on their blog: ** Shirin's playgRound**.

R-bloggers.com offers

(This article was first published on ** Revolutions**, and kindly contributed to R-bloggers)

The future package is a powerful and elegant cross-platform framework for orchestrating asynchronous computations in R. It's ideal for working with computations that take a long time to complete; that would benefit from using distributed, parallel frameworks to make them complete faster; and that you'd rather not have locking up your interactive R session. You can get a good sense of the future package from its introductory vignette or from this eRum 2018 presentation by author by Henrik Bengtsson (video embedded below), but at its simplest it allows constructs in R like this:

a %<-% slow_calculation(1:50) b %<-% slow_calculation(51:100) a+b

The idea here is that `slow_calculation`

is an R function that takes a lot of time, but with the special `%<-%`

assignment operator the computation begins *and the R interpreter is ready immediately*. The first two lines of R code above take essentially zero time to execute. The futures package farms off those computations to another process or even a remote system (you specify which with a preceding `plan`

call), and R will only halt when the *result* is needed, as in the third line above. This is beneficial in Bengtsson's own work, where he uses the future package to parallelize cancer research on DNA sequences on high-performance computing (HPC) clusters.

The future package supports a wide variety of computation frameworks including parallel local R sessions, remote R sessions, and cluster computing frameworks. (If you can't use any of these, it falls back to evaluating the expressions locally, in sequence.) The future package also works in concert other parallel programming systems already available in R. For example, it provides future_lapply as a futurized analog of lapply, which will use whatever computation plan you have defined to run the computations in parallel.

The future package also extends the foreach package thanks to the updated doFuture package. By using `registerDoFuture`

as the foreach backend, your loops can use any computation plan provided by the future package to run the iterations in parallel. (The same applies to R packages that use foreach internally, notably the caret package.) This means you can now use foreach with any of the HPC schedulers supported by future, which includes TORQUE, Slurm, and OpenLava. So if you you share a Slurm HPC cluster with colleagues in your department, you can queue up a parallel simulation on the cluster using code like this:

library("doFuture") registerDoFuture() library("future.batchtools") plan(batchjobs_slurm) mu <- 1.0 sigma <- 2.0 x <- foreach(i = 1:3, .export = c("mu", "sigma")) %dopar% { rnorm(i, mean = mu, sd = sigma) }

The future package is available on CRAN now, and works consistently on Windows, Mac and Linux systems. You can learn more in the video at the end of this post, or in the recent blog update linked below.

To **leave a comment** for the author, please follow the link and comment on their blog: ** Revolutions**.

R-bloggers.com offers

(This article was first published on ** R - Data Science Heroes Blog**, and kindly contributed to R-bloggers)

This is a post about feature selection using genetic algorithms in R, in which we will do a quick review about:

- What are genetic algorithms?
- GA in ML?
- What does a solution look like?
- GA process and its operators
- The fitness function
- Genetics Algorithms in R!
- Try it yourself
- Relating concepts

*Animation source: "Flexible Muscle-Based Locomotion for Bipedal Creatures" – Thomas Geijtenbeek*

Imagine a black box which can help us to decide over an **unlimited number of possibilities**, with a criterion such that we can find an acceptable solution (both in time and quality) to a problem that we formulate.

Genetic Algortithms (GA) are a mathematical model inspired by the famous Charles Darwin’s idea of *natural selection*.

The natural selection preserves only the fittest individuals, over the different generations.

Imagine a population of 100 rabbits in 1900, if we look the population today, we are going to others rabbits more fast and skillful to find food than their ancestors.

In **machine learning**, one of the uses of genetic algorithms is to pick up the right number of variables in order to create a predictive model.

To pick up the right subset of variables is a problem of **combinatory and optimization**.

The advantage of this technique over others is, it allows the best solution to emerge from the best of prior solutions. An evolutionary algorithm which improves the selection over time.

The idea of GA is to combine the different solutions **generation after generation** to extract the best *genes* (variables) from each one. That way it creates new and more fitted individuals.

We can find other uses of GA such as hyper-tunning parameter, find the maximum (or min) of a function or the search for a correct neural network arquitecture (Neuroevolution), or among others…

Every possible solution of the GA, which are the selected variables (a *single* ), are **considered as a whole**, it will not rank variables individually against the target.

And this is important because we already know that variables work in group.

Keeping it simple for the example, imagine we have a total of 6 variables,

One solution can be picking up 3 variables, let’s say: `var2`

, `var4`

and `var5`

.

Another solution can be: `var1`

and `var5`

.

These solutions are the so-called **individuals** or **chromosomes** in a population. They are possible solutions to our problem.

*Credit image: Vijini Mallawaarachchi*

From the image, the solution 3 can be expressed as a one-hot vector: `c(1,0,1,0,1,1)`

. Each `1`

indicates the solution containg that variable. In this case: `var1`

, `var3`

, `var5`

, `var6`

.

While the solution 4 is: `c(1,1,0,1,1,0)`

.

Each position in the vector is a **gene**.

The underlying idea of a GA is to generate some random possible solutions (called `population`

), which represent different variables, to then combine the best solutions in an iterative process.

This combination follows the basic GA operations, which are: selection, mutation and cross-over.

**Selection**: Pick up the most fitted individuals in a generation (i.e.: the solutions providing the highest ROC).**Cross-over**: Create 2 new individuals, based on the genes of two solutions. These children will appear to the next generation.**Mutation**: Change a gene randomly in the individual (i.e.: flip a`0`

to`1`

)

The idea is for each generation, we will find better individuals, like a fast rabbit.

I recommend the post of Vijini Mallawaarachchi about how a genetic algorithm works.

These basic operations allow the algorithm to change the possible solutions by combining them in a way that maximizes the objective.

This objective maximization is, for example, to keep with the solution that maximizes the area under the ROC curve. This is defined in the *fitness function*.

The fitness function takes a possible solution (or chromosome, if you want to sound more sophisticated), and *somehow* evaluates the effectiveness of the selection.

Normally, the fitness function takes the one-hot vector `c(1,1,0,0,0,0)`

, creates, for example, a random forest model with `var1`

and `var2`

, and returns the fitness value (ROC).

The fitness value in this code calculates is: `ROC value / number of variables`

. By doing this the algorithm penalizes the solutions with a large number of variables. Similar to the idea of Akaike information criterion, or AIC.

My intention is to provide you with a clean code so you can understand what’s behind, while at the same time, try new approaches like modifying the fitness function. This is a crucial point.

To use on your own data set, make sure `data_x`

(data frame) and `data_y`

(factor) are compatible with the `custom_fitness`

function.

The main library is `GA`

. See here the vignette with examples.

**Important**: The following code is incomplete. **Clone the repository** to run the example.

```
# data_x: input data frame
# data_y: target variable (factor)
# GA parameters
param_nBits=ncol(data_x)
col_names=colnames(data_x)
# Executing the GA
ga_GA_1 = ga(fitness = function(vars) custom_fitness(vars = vars,
data_x = data_x,
data_y = data_y,
p_sampling = 0.7), # custom fitness function
type = "binary", # optimization data type
crossover=gabin_uCrossover, # cross-over method
elitism = 0.1, # elitism prob
pmutation = 0.03, # mutation rate prob
popSize = 50, # the number of indivduals/solutions
nBits = param_nBits, # total number of variables
names=col_names, # variable name
run=5, # max iter without improvement (stopping criteria)
maxiter = 50, # total runs or generations
monitor=plot, # plot the result at each iteration
keepBest = TRUE, # keep the best solution at the end
parallel = T, # allow parallel procesing
seed=84211 # for reproducibility purposes
)
```

```
# Checking the results
summary(ga_GA_1)
```

```
── Genetic Algorithm ───────────────────
GA settings:
Type = binary
Population size = 50
Number of generations = 50
Elitism = 0.1
Crossover probability = 0.8
Mutation probability = 0.03
GA results:
Iterations = 15
Fitness function value = 0.1648982
Solution =
radius_mean texture_mean perimeter_mean area_mean smoothness_mean compactness_mean
[1,] 0 0 0 0 0 1
concavity_mean concave points_mean symmetry_mean fractal_dimension_mean ...
[1,] 0 1 0 0
symmetry_worst fractal_dimension_worst
[1,] 0 0
```

```
# Following line will return the variable names of the final and best solution
best_vars_ga=col_names[ga_GA_1@solution[1,]==1]
# Checking the variables of the best solution...
best_vars_ga
```

```
[1] "compactness_mean" "concave points_mean" "symmetry_se" "radius_worst"
[5] "compactness_worst" "concavity_worst"
```

- Blue dot: Population fitness average
- Green dot: Best fitness value

Note: Don’t expect the result that fast

Now we calculate the accuracy based on the best selection!

```
get_accuracy_metric(data_tr_sample = data_x, target = data_y, best_vars_ga)
```

```
[1] 0.9402818
```

The accuracy is around 94%, while the ROC value is closed to 0,95 (ROC=fitness value * number of variables, check the fitness function).

I don’t like to analyze the accuracy without the cutpoint (Scoring Data), but it’s useful to compare with the results of this Kaggle post. He got a similar accuracy result using recursive feature elimination, or RFE.

Try a new fitness function, some solutions still provide a large number of variables, you can try squaring the number of variables.

Another thing to try is the algorithm to get the ROC value, or even to change the metric.

Some configurations last a lot of time. Balance classes before modeling and play with the `p_sampling`

parameter. Sampling techniques can have a big impact on models. Check the Sample size and class balance on model performance post for more info.

How about changing the rate of mutation or elitism? Or trying other cross-over methods?

Increase the `popSize`

to test more possible solutions at the same time (at a time cost).

Feel free to share any insights or ideas to improve the selection.

**Clone the repository** to run the example.

There is a parallelism between GA and Deep Learning, the concept of iteration and improvement over time is similar.

I added the `p_sampling`

parameter to speed up things. And it usually accomplishes its goal. Similar to the *batch* concept used in Deep Learning. Another parallel is between the GA parameter `run`

and the *early stopping* criteria in the neural network training.

But the biggest similarity is both techniques come from **observing the nature**. In both cases, humans observed how neural networks and genetics work, and create a simplified mathematical model that imitate their behavior. Nature has millions of years of evolution, why not try to imitate it?

—

I tried to be brief about GA, but if you have any specific question on this vast topic, please leave it in the comments

*And, if I didn’t motivate you the enough to study GA, check this project which is based on Neuroevolution:*

Thanks for reading

Find me on Twitter and Linkedin.

More blog posts.

Want to learn more? Data Science Live Book

To **leave a comment** for the author, please follow the link and comment on their blog: ** R - Data Science Heroes Blog**.

R-bloggers.com offers

(This article was first published on ** R – intobioinformatics**, and kindly contributed to R-bloggers)

Clusterlab is a CRAN package (https://cran.r-project.org/web/packages/clusterlab/index.html) for the routine testing of clustering algorithms. It can simulate positive (data-sets with >1 clusters) and negative controls (data-sets with 1 cluster). Why test clustering algorithms? Because they often fail in identifying the true K in practice, published algorithms are not always well tested, and we need to know about ones that have strange behaviour. I’ve found in many own experiments on clustering algorithms that algorithms many people are using are not necessary ones that provide the most sensible results. I can give a good example below.

I was interested to see clusterExperiment, a relatively new clustering package on the Bioconductor, cannot detect the ground truth K in my testing so far. Instead yielding solutions with many more clusters than there are in reality. On the other hand, the package I developed with David Watson here at QMUL, does work rather well. It is a derivative of the Monti et al. (2003) consensus clustering algorithm, fitted with a Monte Carlo reference procedure to eliminate overfitting. We called this algorithm M3C.

library(clusterExperiment) library(clusterlab) library(M3C)

**Experiment 1: Simulate a positive control dataset (K=5)**

k5 <- clusterlab(centers=5)

**Experiment 2: Test RSEC (https://bioconductor.org/packages/release/bioc/html/clusterExperiment.html)
**

rsec_test <- RSEC(as.matrix(k5$synthetic_data),ncores=10) assignments <- primaryCluster(rsec_test) pca(k5$synthetic_data,labels=assignments)

**Experiment 3: Test M3C (https://bioconductor.org/packages/release/bioc/html/M3C.html)
**

M3C_test <- M3C(k5$synthetic_data,iters=10,cores=10) optk <- which.max(M3C_test$scores$RCSI)+1 pca(M3C_test$realdataresults[[optk]]$ordered_data, labels=M3C_test$realdataresults[[optk]]$ordered_annotation$consensuscluster,printres=TRUE)

Interesting isn't it R readers.

Well all I can say is I recommend comparing different machine learning methods and if in doubt, run your own control data-sets (positive and negative controls) to test various methods. In the other post we showed a remarkable bias in optimism corrected bootstrapping compared with LOOCV under certain conditions, simply by simulating null data-sets and passing them into the method.

**If you are clustering omic’ data from a single platform (RNAseq, protein, methylation, etc) I have tested and recommend the following packages:**

CLEST: https://www.rdocumentation.org/packages/RSKC/versions/2.4.2/topics/Clest

M3C: https://bioconductor.org/packages/release/bioc/html/M3C.html

And to be honest, that is about it. I have also tested PINSplus, GAP-statistic, and SNF, but they did not provide satisfactory results in my experiments on single platform clustering (currently unpublished data). Multi-omic and single cell RNA-seq analysis is another story, there will be more on that to come in the future R readers.

Remember there is a darkness out there R readers, not just in Washington, but in the world of statistics. It is there because of the positive results bias in science, because of people not checking methods and comparing them with one another, and because of several other reasons I can’t even be bothered to mention.

To **leave a comment** for the author, please follow the link and comment on their blog: ** R – intobioinformatics**.

R-bloggers.com offers