## Learning Data Science: Predicting Income Brackets

As promised in the post Learning Data Science: Modelling Basics we will now go a step further and try to predict income brackets with real world data and different modelling approaches. We will learn a thing or two along the way, e.g. about the so-called Accuracy-Interpretability Trade-Off, so read on…

The data we will use is from here: Marketing data set. The description reads:

This dataset contains questions from questionnaires that were filled out by shopping mall customers in the San Francisco Bay area. The goal is to predict the Annual Income of Household from the other 13 demographics attributes.

The following extra information (or metadata) is provided with the data:

cat(readLines('data/marketing-names.txt'), sep = '\n')
Marketing data set
1: Description.
This dataset contains questions from questionaries that were filled out by shopping mall customers in the San Francisco Bay area. The goal is to predict the Anual Income of Household from the other 13 demographics  attributes.
2: Type.            Classification
3: Origin.          Real world
4: Instances.       6876 (8993)
5: Features.        14
6: Classes          9
7: Missing values.  Yes
@relation marketing
@attribute Sex integer [1, 2]
@attribute MaritalStatus integer [1, 5]
@attribute Age integer [1, 7]
@attribute Education integer [1, 6]
@attribute Occupation integer [1, 9]
@attribute YearsInSf integer [1, 5]
@attribute DualIncome integer [1, 3]
@attribute HouseholdMembers integer [1, 9]
@attribute Under18 integer [0, 9]
@attribute HouseholdStatus integer [1, 3]
@attribute TypeOfHome integer [1, 5]
@attribute EthnicClass integer [1, 8]
@attribute Language integer [1, 3]
@attribute Income {1, 2, 3, 4, 5, 6, 7, 8, 9}
@inputs Sex, MaritalStatus, Age, Education, Occupation, YearsInSf, DualIncome, HouseholdMembers, Under18, HouseholdStatus, TypeOfHome, EthnicClass, Language
@outputs Income
DATA DICTIONARY
1    ANNUAL INCOME OF HOUSEHOLD (PERSONAL INCOME IF SINGLE)
1. Less than $10,000 2.$10,000 to $14,999 3.$15,000 to $19,999 4.$20,000 to $24,999 5.$25,000 to $29,999 6.$30,000 to $39,999 7.$40,000 to $49,999 8.$50,000 to $74,999 9.$75,000 or more

2    SEX
1. Male
2. Female
3    MARITAL STATUS
1. Married
2. Living together, not married
3. Divorced or separated
4. Widowed
5. Single, never married
4    AGE
1. 14 thru 17
2. 18 thru 24
3. 25 thru 34
4. 35 thru 44
5. 45 thru 54
6. 55 thru 64
7. 65 and Over
5    EDUCATION
4. 1 to 3 years of college
6    OCCUPATION
1. Professional/Managerial
2. Sales Worker
3. Factory Worker/Laborer/Driver
4. Clerical/Service Worker
5. Homemaker
6. Student, HS or College
7. Military
8. Retired
9. Unemployed
7    HOW LONG HAVE YOU LIVED IN THE SAN FRAN./OAKLAND/SAN JOSE AREA?
1. Less than one year
2. One to three years
3. Four to six years
4. Seven to ten years
5. More than ten years
8    DUAL INCOMES (IF MARRIED)
1. Not Married
2. Yes
3. No
1. One
2. Two
3. Three
4. Four
5. Five
6. Six
7. Seven
8. Eight
9. Nine or more

10    PERSONS IN HOUSEHOLD UNDER 18
0. None
1. One
2. Two
3. Three
4. Four
5. Five
6. Six
7. Seven
8. Eight
9. Nine or more
11    HOUSEHOLDER STATUS
1. Own
2. Rent
3. Live with Parents/Family
12    TYPE OF HOME
1. House
2. Condominium
3. Apartment
4. Mobile Home
5. Other
13    ETHNIC CLASSIFICATION
1. American Indian
2. Asian
3. Black
4. East Indian
5. Hispanic
6. Pacific Islander
7. White
8. Other
14    WHAT LANGUAGE IS SPOKEN MOST OFTEN IN YOUR HOME?
1. English
2. Spanish
3. Other


Our task is to predict the variable “Income”.

So, let us first load the data, have a look at some of its characteristics and perform a little bit of formatting. After that we divide it into a training (80%) and a test set (20%) to account for potential overfitting (also see Learning Data Science: Modelling Basics):

data <- read.csv("data/marketing.dat")
dim(data)
## [1] 6876   14

str(data)
## 'data.frame':    6876 obs. of  14 variables:
##  $Sex : int 1 2 2 2 1 1 1 1 1 1 ... ##$ MaritalStatus   : int  1 1 5 5 1 5 3 1 1 5 ...
##  $Age : int 5 3 1 1 6 2 3 6 7 2 ... ##$ Education       : int  5 5 2 2 4 3 4 3 4 4 ...
##  $Occupation : int 5 1 6 6 8 9 3 8 8 9 ... ##$ YearsInSf       : int  5 5 5 3 5 4 5 5 4 5 ...
##  $DualIncome : int 3 2 1 1 3 1 1 3 3 1 ... ##$ HouseholdMembers: int  5 3 4 4 2 3 1 3 2 1 ...
##  $Under18 : int 2 1 2 2 0 1 0 0 0 0 ... ##$ HouseholdStatus : int  1 2 3 3 1 2 2 2 2 2 ...
##  $TypeOfHome : int 1 3 1 1 1 3 3 3 3 3 ... ##$ EthnicClass     : int  7 7 7 7 7 7 7 7 7 7 ...
##  $Language : int 1 1 1 1 1 1 1 1 1 1 ... ##$ Income          : int  9 9 1 1 8 1 6 2 4 1 ...

data_names <- names(data)
data <- cbind(data[-ncol(data)], factor(data$Income)) # make variable Income (which should be predicted) a factor names(data) <- data_names set.seed(12) random <- sample(1:nrow(data), 0.8 * nrow(data)) data_train <- data[random, ] data_test <- data[-random, ]  We start with a simple but comprehensible model, OneR (on CRAN), as a benchmark: library(OneR) data <- optbin(data_train) model <- OneR(data, verbose = TRUE) ## ## Attribute Accuracy ## 1 * Age 28.2% ## 2 MaritalStatus 28.11% ## 3 Occupation 28.07% ## 4 HouseholdStatus 27.56% ## 5 DualIncome 27.04% ## 6 Education 25.98% ## 7 HouseholdMembers 22.51% ## 8 Under18 20.69% ## 9 TypeOfHome 19.36% ## 10 EthnicClass 19.29% ## 11 Sex 18.07% ## 12 Language 17.82% ## 13 YearsInSf 17.75% ## --- ## Chosen attribute due to accuracy ## and ties method (if applicable): '*' summary(model) ## ## Call: ## OneR.data.frame(x = data, verbose = TRUE) ## ## Rules: ## If Age = 1 then Income = 1 ## If Age = 2 then Income = 1 ## If Age = 3 then Income = 6 ## If Age = 4 then Income = 8 ## If Age = 5 then Income = 8 ## If Age = 6 then Income = 8 ## If Age = 7 then Income = 6 ## ## Accuracy: ## 1551 of 5500 instances classified correctly (28.2%) ## ## Contingency table: ## Age ## Income 1 2 3 4 5 6 7 Sum ## 1 * 421 * 352 99 43 21 15 25 976 ## 2 16 204 107 39 13 22 33 434 ## 3 9 147 122 49 12 21 35 395 ## 4 5 121 188 71 39 29 42 495 ## 5 3 77 179 81 29 23 34 426 ## 6 10 93 * 234 156 70 56 * 47 666 ## 7 12 92 185 155 107 66 33 650 ## 8 12 111 211 * 251 * 160 * 86 44 875 ## 9 11 76 114 187 104 69 22 583 ## Sum 499 1273 1439 1032 555 387 315 5500 ## --- ## Maximum in each column: '*' ## ## Pearson's Chi-squared test: ## X-squared = 2671.2, df = 48, p-value < 2.2e-16 plot(model)  prediction <- predict(model, data_test) eval_model(prediction, data_test) ## ## Confusion matrix (absolute): ## Actual ## Prediction 1 2 3 4 5 6 7 8 9 Sum ## 1 232 45 46 32 33 27 19 27 24 485 ## 2 0 0 0 0 0 0 0 0 0 0 ## 3 0 0 0 0 0 0 0 0 0 0 ## 4 0 0 0 0 0 0 0 0 0 0 ## 5 0 0 0 0 0 0 0 0 0 0 ## 6 31 30 44 44 41 66 44 57 50 407 ## 7 0 0 0 0 0 0 0 0 0 0 ## 8 16 20 20 47 27 87 71 110 86 484 ## 9 0 0 0 0 0 0 0 0 0 0 ## Sum 279 95 110 123 101 180 134 194 160 1376 ## ## Confusion matrix (relative): ## Actual ## Prediction 1 2 3 4 5 6 7 8 9 Sum ## 1 0.17 0.03 0.03 0.02 0.02 0.02 0.01 0.02 0.02 0.35 ## 2 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 ## 3 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 ## 4 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 ## 5 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 ## 6 0.02 0.02 0.03 0.03 0.03 0.05 0.03 0.04 0.04 0.30 ## 7 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 ## 8 0.01 0.01 0.01 0.03 0.02 0.06 0.05 0.08 0.06 0.35 ## 9 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 ## Sum 0.20 0.07 0.08 0.09 0.07 0.13 0.10 0.14 0.12 1.00 ## ## Accuracy: ## 0.2965 (408/1376) ## ## Error rate: ## 0.7035 (968/1376) ## ## Error rate reduction (vs. base rate): ## 0.1176 (p-value < 2.2e-16)  What can we conclude from this? First the most important feature is “Age” while “Marital Status”, “Occupation” and “Household Status” perform comparably well. The overall accuracy (i.e. the percentage of correctly predicted instances) is with about 30% not that great, on the other hand not that extraordinarily uncommon when dealing with real-world data. Looking at the model itself, in the form of rules and the diagnostic plot, we can see that we have non-linear relationship between Age and Income: the older one gets the higher the income bracket, except for people who are old enough to retire. This seems plausible. OneR is basically a one step decision tree, so the natural choice for our next analysis would be to have a full decision tree built (all packages are on CRAN): library(rpart) library(rpart.plot) model <- rpart(Income ~., data = data_train) rpart.plot(model, type = 3, extra = 0, box.palette = "Grays")  prediction <- predict(model, data_test, type = "class") eval_model(prediction, data_test) ## ## Confusion matrix (absolute): ## Actual ## Prediction 1 2 3 4 5 6 7 8 9 Sum ## 1 201 36 22 13 16 12 8 15 12 335 ## 2 43 25 32 22 17 12 10 14 6 181 ## 3 0 0 0 0 0 0 0 0 0 0 ## 4 0 0 0 0 0 0 0 0 0 0 ## 5 0 0 0 0 0 0 0 0 0 0 ## 6 18 24 40 50 42 68 32 33 22 329 ## 7 0 0 0 0 0 0 0 0 0 0 ## 8 17 10 16 38 26 88 84 132 120 531 ## 9 0 0 0 0 0 0 0 0 0 0 ## Sum 279 95 110 123 101 180 134 194 160 1376 ## ## Confusion matrix (relative): ## Actual ## Prediction 1 2 3 4 5 6 7 8 9 Sum ## 1 0.15 0.03 0.02 0.01 0.01 0.01 0.01 0.01 0.01 0.24 ## 2 0.03 0.02 0.02 0.02 0.01 0.01 0.01 0.01 0.00 0.13 ## 3 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 ## 4 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 ## 5 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 ## 6 0.01 0.02 0.03 0.04 0.03 0.05 0.02 0.02 0.02 0.24 ## 7 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 ## 8 0.01 0.01 0.01 0.03 0.02 0.06 0.06 0.10 0.09 0.39 ## 9 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 ## Sum 0.20 0.07 0.08 0.09 0.07 0.13 0.10 0.14 0.12 1.00 ## ## Accuracy: ## 0.3096 (426/1376) ## ## Error rate: ## 0.6904 (950/1376) ## ## Error rate reduction (vs. base rate): ## 0.134 (p-value < 2.2e-16)  The new model has an accuracy that is a little bit better but the interpretability is a little bit worse. You have to go through the different branches to see in which income bracket it ends. So for example when the age bracket is below (which means that it is ) it predicts income bracket , when it is bigger than and the Household Status bracket is it predicts income income bracket and so on. The most important variable, as you can see is again Age (which is reassuring that OneR was on the right track) but we also see Household Status and Occupation again. What is better than one tree? Many trees! So the next natural step is to have many trees built, while varying the features and the examples that should be included in each tree. At the end it may be that different trees give different income brackets as their respective prediction, which we solve via voting as in a good democracy. This whole method is fittingly called Random Forests and fortunately there are many good packages available in R. We use the randomForest package (also on CRAN) here: library(randomForest) ## randomForest 4.6-14 ## Type rfNews() to see new features/changes/bug fixes. set.seed(4543) # for reproducibility model <- randomForest(Income ~., data = data_train, importance = TRUE) varImpPlot(model)  prediction <- predict(model, data_test) eval_model(prediction, data_test) ## ## Confusion matrix (absolute): ## Actual ## Prediction 1 2 3 4 5 6 7 8 9 Sum ## 1 223 35 26 16 19 11 9 18 16 373 ## 2 24 15 12 20 7 11 1 4 1 95 ## 3 9 10 11 14 9 6 3 4 2 68 ## 4 5 15 25 22 10 22 6 9 5 119 ## 5 2 2 8 9 6 12 6 3 1 49 ## 6 3 5 15 14 19 40 23 17 15 151 ## 7 8 4 7 13 14 26 25 24 5 126 ## 8 3 8 5 11 13 44 49 87 68 288 ## 9 2 1 1 4 4 8 12 28 47 107 ## Sum 279 95 110 123 101 180 134 194 160 1376 ## ## Confusion matrix (relative): ## Actual ## Prediction 1 2 3 4 5 6 7 8 9 Sum ## 1 0.16 0.03 0.02 0.01 0.01 0.01 0.01 0.01 0.01 0.27 ## 2 0.02 0.01 0.01 0.01 0.01 0.01 0.00 0.00 0.00 0.07 ## 3 0.01 0.01 0.01 0.01 0.01 0.00 0.00 0.00 0.00 0.05 ## 4 0.00 0.01 0.02 0.02 0.01 0.02 0.00 0.01 0.00 0.09 ## 5 0.00 0.00 0.01 0.01 0.00 0.01 0.00 0.00 0.00 0.04 ## 6 0.00 0.00 0.01 0.01 0.01 0.03 0.02 0.01 0.01 0.11 ## 7 0.01 0.00 0.01 0.01 0.01 0.02 0.02 0.02 0.00 0.09 ## 8 0.00 0.01 0.00 0.01 0.01 0.03 0.04 0.06 0.05 0.21 ## 9 0.00 0.00 0.00 0.00 0.00 0.01 0.01 0.02 0.03 0.08 ## Sum 0.20 0.07 0.08 0.09 0.07 0.13 0.10 0.14 0.12 1.00 ## ## Accuracy: ## 0.3459 (476/1376) ## ## Error rate: ## 0.6541 (900/1376) ## ## Error rate reduction (vs. base rate): ## 0.1796 (p-value < 2.2e-16)  As an aside you can see that the basic modelling workflow stayed the same – no matter what approach (OneR, decision tree or random forest) you chose. This standard is kept for most (modern) packages and one of the great strengths of R! With thousands of packages on CRAN alone there are of course sometimes differences but those are normally well documented (so the help function or the vignette are your friends!) Now, back to the result of the last analysis: again, the overall accuracy is better but because we have hundreds of trees now the interpretability suffered a lot. This is also known under the name Accuracy-Interpretability Trade-Off. The best we can do in the case of random forests is to find out which features are the most important ones. Again Age, Occupation and Household Status show up, which is consistent with our analyses so far. Additionally, because of the many trees that had to be built, this analysis ran a lot longer than the other two. Random forests are one of the best methods out of the box, so the accuracy of about 34% tells us that our first model (OneR) wasn’t doing too bad in the first place! Why are able to reach comparatively good results with just one feature. This is true for many real-world data sets. Sometimes simple models are not much worse than very complicated ones – you should keep that in mind! If you play around with this dataset I would be interested in your results! Please post them in the comments – Thank you and stay tuned! ## Learning R: The Collatz Conjecture In this post we will see that a little bit of simple R code can go a very long way! So let’s get started! One of the fascinating features of number theory (unlike many other branches of mathematics) is that many statements are easy to make but the brightest minds are not able to prove them, the so called Collatz conjecture (named after the German mathematician Lothar Collatz) is an especially fascinating example: The Collatz conjecture states that when you start with any positive integer and • if it is even, the next number is one half the previous number and, • if it is odd, the next number is three times the previous number plus one • the sequence will always reach one. It doesn’t get any simpler than that but no one has been able to prove this – and not for a lack of trying! The great mathematician Paul Erdős said about it “Mathematics may not be ready for such problems.” You can read more on Wikipedia: Collatz conjecture and an especially nice film that was made by a group of students can be watched here: The Collatz Conjecture. So let us write a little program and try some numbers! First we need a simple helper function to determine whether a number is even: is.even <- function(x) { if (x %% 2 == 0) TRUE else FALSE } is.even(2) ## [1] TRUE is.even(3) ## [1] FALSE  Normally we wouldn’t use a dot within function names but R itself (because of its legacy code) is not totally consistent here and the is-function family (like is.na or is.integer) all use a dot. After that we write a function for the rule itself, making use of the is.even function: collatz <- function(n) { if (is.even(n)) n/2 else 3 * n + 1 } collatz(6) ## [1] 3 collatz(5) ## [1] 16  To try a number and plot it (like in the Wikipedia article) we could use a while-loop: n_total <- n <- 27 while (n != 1) { n <- collatz(n) n_total <- c(n_total, n) } n_total ## [1] 27 82 41 124 62 31 94 47 142 71 214 107 322 161 ## [15] 484 242 121 364 182 91 274 137 412 206 103 310 155 466 ## [29] 233 700 350 175 526 263 790 395 1186 593 1780 890 445 1336 ## [43] 668 334 167 502 251 754 377 1132 566 283 850 425 1276 638 ## [57] 319 958 479 1438 719 2158 1079 3238 1619 4858 2429 7288 3644 1822 ## [71] 911 2734 1367 4102 2051 6154 3077 9232 4616 2308 1154 577 1732 866 ## [85] 433 1300 650 325 976 488 244 122 61 184 92 46 23 70 ## [99] 35 106 53 160 80 40 20 10 5 16 8 4 2 1 plot(n_total, type = "l", col = "blue", xlab = "", ylab = "")  As you can see, after a wild ride the sequence finally reaches one as expected. We end with some nerd humour from the cult website xkcd: ## Evolution works! Hamlet: Do you see yonder cloud that’s almost in shape of a camel? Polonius: By the mass, and ’tis like a camel, indeed. Hamlet: Methinks it is like a weasel. from Hamlet by William Shakespeare The best way to see how evolution works, is to watch it in action! You can watch the evolution of cars live in this application (but be careful, it’s addictive): BoxCar 2D It is fascinating to see how those cars get better and better over time, sometimes finding very impressive solutions: To understand how evolution works even better, let us create an artificial evolution in R! The famous evolutionary biologist Richard Dawkins gave in his book “The Blind Watchmaker” the following thought experiment: I don’t know who it was first pointed out that, given enough time, a monkey bashing away at random on a typewriter could produce all the works of Shakespeare. The operative phrase is, of course, given enough time. Let us limit the task facing our monkey somewhat. Suppose that he has to produce, not the complete works of Shakespeare but just the short sentence ‘Methinks it is like a weasel’, and we shall make it relatively easy by giving him a typewriter with a restricted keyboard, one with just the 26 (capital) letters, and a space bar. How long will he take to write this one little sentence? We are now going to put this idea into practice! The following outline is from the Wikipedia article on the weasel program (Weasel program): 1. Start with a random string of 28 characters. 2. Make 100 copies of the string (reproduce). 3. For each character in each of the 100 copies, with a probability of 5%, replace (mutate) the character with a new random character. 4. Compare each new string with the target string “METHINKS IT IS LIKE A WEASEL”, and give each a score (the number of letters in the string that are correct and in the correct position). 5. If any of the new strings has a perfect score (28), halt. Otherwise, take the highest scoring string, and go to step 2. So let us first define some variables and helper functions for reproduction, mutation and fitness calculation: target <- unlist(strsplit("METHINKS IT IS LIKE A WEASEL" , "")) # assign target string to "target" pop_sz <- 100 # assign population size 100 to "pop_sz" mt_rt <- 0.05 # assign mutation rate 5% to "mt_rt" reproduce <- function(string) { # input: vector "string" # output: matrix with "pop_sz" columns, where each column is vector "string" matrix(string, nrow = length(string), ncol = pop_sz) } mutate <- function(pop) { # input: matrix of population "pop" # output: matrix of population where each character, with a probability of mt_rt per cent (= 5%), is replaced with a new random character mt_pos <- runif(length(pop)) <= mt_rt pop[mt_pos] <- sample(c(LETTERS, " "), sum(mt_pos), replace = TRUE) pop } fitness <- function(pop) { # input: matrix of population "pop" # output: vector of the number of letters that are correct (= equal to target) for each column colSums(pop == target) }  After that we are going through all five steps listed above: # 1. Start with a random string of 28 characters. set.seed(70) start <- sample(c(LETTERS, " "), length(target), replace = TRUE) # 2. Make 100 copies of this string (reproduce). pop <- reproduce(start) # 3. For each character in each of the 100 copies, with a probability of 5%, replace (mutate) the character with a new random character. pop <- mutate(pop) # 4. Compare each new string with the target "METHINKS IT IS LIKE A WEASEL", and give each a score (the number of letters in the string that are correct and in the correct position). score <- fitness(pop) # 5. If any of the new strings has a perfect score (28), halt. Otherwise, take the highest scoring string, and go to step 2. highscorer <- pop[ , which.max(score)] # assign string to "highscorer" which has max. score in the population gen_no <- 1 #assign 1 to generation counter "gen_no" while (max(score) < length(target)) { cat("No. of generations: ", gen_no, ", best so far: ", highscorer, " with score: ", max(score), "\n", sep = "") pop <- reproduce(highscorer) # 2. select the highest scoring string for reproduction pop <- mutate(pop) # 3. mutation score <- fitness(pop) # 4. fitness calculation highscorer <- pop[ , which.max(score)] # assign string to "highscorer" which has max. score in the population gen_no <- gen_no + 1 # increment generation counter } ## No. of generations: 1, best so far: BZRDXXINEIMYQVJWBFZKFCVUPFYL with score: 2 ## No. of generations: 2, best so far: BZRDXNINEIMYQVJWBFZKFCVUPFYL with score: 3 ## No. of generations: 3, best so far: BZRDXNINEIMYQVJWBFZKACVEPFYR with score: 4 ## No. of generations: 4, best so far: BZRDININEIMYQBJWBFZKACVEPFYR with score: 5 ## No. of generations: 5, best so far: BZRDININEIMYIBJWBFZKACVEPFYR with score: 6 ## No. of generations: 6, best so far: BZRDININEIMYIBJLBFZKACVEPFYR with score: 7 ## No. of generations: 7, best so far: BRRDININEIMYIBJLOFZKACVEPFYL with score: 8 ## No. of generations: 8, best so far: BRRDININEIMYIZJLOFZKACVEAFYL with score: 9 ## No. of generations: 9, best so far: BRRDINKNEIMYIZJLOFZKAT EAFYL with score: 10 ## No. of generations: 10, best so far: BRRDINKNEIMYIZJLOFZKATVEASYL with score: 11 ## No. of generations: 11, best so far: BRRDINKNEIMYIZJLOFEKATVEASYL with score: 12 ## No. of generations: 12, best so far: BRRUINKNEIMYIZJLOFEKATVEASEL with score: 13 ## No. of generations: 13, best so far: BERUINKNEIMYIZJLOFEKATVEASEL with score: 14 ## No. of generations: 14, best so far: BERHINKNEIMYIZJLVFEKATVEASEL with score: 15 ## No. of generations: 15, best so far: BERHINKNEIMQIZJLVFE ATVEASEL with score: 16 ## No. of generations: 16, best so far: BERHINKNEIMQIZ LVFE ATVEASEL with score: 17 ## No. of generations: 17, best so far: BERHINKNEIMQIZ LVFE ATVEASEL with score: 17 ## No. of generations: 18, best so far: BERHINKNEIMQIZ LVFE ATVEASEL with score: 17 ## No. of generations: 19, best so far: TERHINKNEIMQIZ LVFE ATDEASEL with score: 17 ## No. of generations: 20, best so far: TERHINKNEIMQIZ LVFE ATDEASEL with score: 17 ## No. of generations: 21, best so far: TERHINKNJISQIZ LVFE ATDEASEL with score: 17 ## No. of generations: 22, best so far: TERHINKNJISQIZ LVFE A DEASEL with score: 18 ## No. of generations: 23, best so far: TERHINKNJISQIZ LVFE A DEASEL with score: 18 ## No. of generations: 24, best so far: TERHINKNJITQIZ LVFE A YEASEL with score: 19 ## No. of generations: 25, best so far: TERHINKNJITQIZ LPFE A YEASEL with score: 19 ## No. of generations: 26, best so far: TERHINKN ITQIZ LPFE A YEASEL with score: 20 ## No. of generations: 27, best so far: MERHINKN ITQIZ LPFE A YEASEL with score: 21 ## No. of generations: 28, best so far: MERHINKN IT IZ LPFE A YEASEL with score: 22 ## No. of generations: 29, best so far: MERHINKN IT IS LPFE A YEASEL with score: 23 ## No. of generations: 30, best so far: MERHINKN IT IS LPFE A YEASEL with score: 23 ## No. of generations: 31, best so far: MERHINKN IT IS LPFE A YEASEL with score: 23 ## No. of generations: 32, best so far: MERHINKN IT IS LAFE A WEASEL with score: 24 ## No. of generations: 33, best so far: METHINKN IT IS LAFE A WEASEL with score: 25 ## No. of generations: 34, best so far: METHINKN IT IS LAFE A WEASEL with score: 25 ## No. of generations: 35, best so far: METHINKN IT IS LAFE A WEASEL with score: 25 ## No. of generations: 36, best so far: METHINKN IT IS LAFE A WEASEL with score: 25 ## No. of generations: 37, best so far: METHINKN IT IS LAFE A WEASEL with score: 25 ## No. of generations: 38, best so far: METHINKU IT IS LIFE A WEASEL with score: 26 ## No. of generations: 39, best so far: METHINKU IT IS LIFE A WEASEL with score: 26 ## No. of generations: 40, best so far: METHINKU IT IS LIFE A WEASEL with score: 26 ## No. of generations: 41, best so far: METHINKU IT IS LIKE A WEASEL with score: 27 ## No. of generations: 42, best so far: METHINKU IT IS LIKE A WEASEL with score: 27 ## No. of generations: 43, best so far: METHINKU IT IS LIKE A WEASEL with score: 27 ## No. of generations: 44, best so far: METHINKU IT IS LIKE A WEASEL with score: 27 ## No. of generations: 45, best so far: METHINKU IT IS LIKE A WEASEL with score: 27 cat("Mission accomplished in ", gen_no, " generations: ", highscorer, sep = "") ## Mission accomplished in 46 generations: METHINKS IT IS LIKE A WEASEL  As you can see, the algorithm arrived at the target phrase pretty quickly. Now, you can try to tweak different parameter setting, like the population size or the mutation rate, and see what happens. You can of course also change the target phrase. A minority of (often very religious) people reject the fact of evolution because they miss a crucial step: selection based on fitness. Selection gives evolution direction towards solutions that are better able to solve a certain problem. It is the exact opposite of pure randomness which many people still suspect behind evolution. To see the difference the only thing we have to do is to comment out the line pop <- reproduce(highscorer) which selects the highest scoring string for reproduction. We can see that without selection there is no improvement to be seen and the algorithm would run “forever”: ## No. of generations: 1, best so far: UJGGZYOEDJMRADTQUXFWAVWPBGFX with score: 2 ## No. of generations: 2, best so far: UHGGZQOEDJERAD QBXFSBRWPBGFX with score: 2 ## No. of generations: 3, best so far: UNGDZYOEDSERADTQIXFSBVWPAGFX with score: 3 ## No. of generations: 4, best so far: UHGGZQNEDJERAG QBXFSBRWPBGWX with score: 2 ## No. of generations: 5, best so far: IDGGTJOELJERAETQBDFSBVWEBGFX with score: 2 ## No. of generations: 6, best so far: IDGGTJOELJERNETQBDFSBVWEBGFX with score: 2 ## No. of generations: 7, best so far: FNJGZYOESJERERTQGXGSBVWEBSFX with score: 3 ## No. of generations: 8, best so far: UJGWZBOERJMUAQTQUXFVAVWKKSFX with score: 3 ## No. of generations: 9, best so far: VETGRYOEYVVSAOTQBKOSTVPPGGFM with score: 3 ## No. of generations: 10, best so far: VETGRYOEYVVSAOTQBKOSTVPPGGFM with score: 3 ## No. of generations: 11, best so far: VETGRYOEYVVSAKTQBKOSTVPPGGFM with score: 3 ## No. of generations: 12, best so far: IETGRYOTYVVDAKTQBKOCTVPPGGFM with score: 3 ## No. of generations: 13, best so far: TTVVZOKDJERADELYXFKWGWXKGYO with score: 3 ## No. of generations: 14, best so far: UNGWCYOZDEWRAD WKXKSBVWECGFX with score: 3 ## No. of generations: 15, best so far: UNGWCYOZDEWRBD WKXKSBVWECGFX with score: 3 ## No. of generations: 16, best so far: UNGSCYOZDEWRBD WKXKSAVCECGFX with score: 3 ## No. of generations: 17, best so far: MXKGZYOMSJ RIOTQBLJSBVNPAGDL with score: 4 ## No. of generations: 18, best so far: MXKGZYOMSJ RIOTQBLJSBVNPAGDL with score: 4 ## No. of generations: 19, best so far: MXKGZYOMZJ RIOTQBLJSVVNPAGDL with score: 4 ## No. of generations: 20, best so far: TTVVJGKDDERADELYJXKRGWEKGYU with score: 4 ## No. of generations: 21, best so far: TTVVJGKDDERADELYDXBRGWEKGYU with score: 4 ## No. of generations: 22, best so far: TTWVJGKDQERADELYDXBRGWEKGYU with score: 4 ## No. of generations: 23, best so far: MXKGOYOMCJ RIOTQBLJIVVAPAJDG with score: 3 ## No. of generations: 24, best so far: MXKGOYOMCJ RIOTQBLJIVVAPAJDG with score: 3 ## No. of generations: 25, best so far: MXKGOYOMCJ RIOTQBLJIVVAPAJDG with score: 3 ## No. of generations: 26, best so far: MXKGOYOMCJ RIOTQBLJIVVAPAJDG with score: 3 ## No. of generations: 27, best so far: TNTUXYKJPJNDAITLAJTYBAWPMGGB with score: 4 ## No. of generations: 28, best so far: MXKGOYOMCJ RIOTLBLJIVVAPAJDX with score: 4 ## No. of generations: 29, best so far: MXKGOYOMCJ RIOTLBLJIVVAJAJDX with score: 4 ## No. of generations: 30, best so far: TUTUYYKNPJNDAITLAJTYBAAPMOGB with score: 3 ## No. of generations: 31, best so far: NGAFULYDZELWD QDPRSMPWYAPZH with score: 3 ## No. of generations: 32, best so far: HKUOZSJSXDERS TLBHASAVGPBEJT with score: 3 ## No. of generations: 33, best so far: NGAFULYDTELWD QDPRSMPWYAPZH with score: 3 ## No. of generations: 34, best so far: HKUYMSJAXDERS TLBHA AVGPBEJT with score: 3 ## No. of generations: 35, best so far: HKUYMSJAXDSRS TLBHA AVGPBEJT with score: 3 ## No. of generations: 36, best so far: HKXYMSJYXDSRS TLBHA AVGPNEJT with score: 3 ## No. of generations: 37, best so far: KNGABULYDTELWD QDORSFPWYAPZH with score: 3 ## No. of generations: 38, best so far: LLCIZN EOISJ DHFIEGPXNWYMYOX with score: 4 ## No. of generations: 39, best so far: LLCIZN EOISJ DHFIEXPXNWYMYOX with score: 4 ## No. of generations: 40, best so far: MZN KMIESQRRILELIIILFIGRYRZZ with score: 4 ## No. of generations: 41, best so far: ITQXZEKK SENLSCJXAKQ EKNCNUJ with score: 3 ## No. of generations: 42, best so far: MELBV VEUBRKXSNHWGILBU JVLZX with score: 3 ## No. of generations: 43, best so far: DZNAKMIEOQRRILELIVILKIGVYRZZ with score: 3 ## No. of generations: 44, best so far: DZNAKMIEOQRRILELIVILKIGVYRZZ with score: 3 ## No. of generations: 45, best so far: LRPDILXMGCWDD ZQD BKANWHMKFI with score: 3 ## No. of generations: 46, best so far: KEGAMRLYDAELDDUXLORSFPWOAPLH with score: 3 ## No. of generations: 47, best so far: KEGAMRLYDAELDDUXLORSFPWOAPLH with score: 3 ## No. of generations: 48, best so far: KEGAMRLYDAELDZUXLORHFPWOAPLH with score: 3 ## No. of generations: 49, best so far: KEGAMRLYDAEWDZUXLORHFPWOAPLH with score: 3 ## No. of generations: 50, best so far: KEGAMRLYDAEWDZDXLORHFPWOAPLH with score: 3  If this was how evolution really worked it wouldn’t work at all. Because evolution is a very powerful optimization method there are also real world applications of so called genetic algorithms (GA). In the following example we want to find the global optimum of the so called Rastrigin function. What makes this task especially difficult for this popular test problem is the large number of local minima, as can be seen when plotting the function: library(GA) ## Loading required package: foreach ## Loading required package: iterators ## Package 'GA' version 3.2 ## Type 'citation("GA")' for citing this R package in publications. ## ## Attaching package: 'GA' ## The following object is masked from 'package:utils': ## ## de Rastrigin <- function(x1, x2) { 20 + x1^2 + x2^2 - 10*(cos(2*pi*x1) + cos(2*pi*x2)) } x1 <- x2 <- seq(-5.12, 5.12, by = 0.1) f <- outer(x1, x2, Rastrigin) persp3D(x1, x2, f, theta = 50, phi = 20)  filled.contour(x1, x2, f, color.palette = bl2gr.colors)  To find the global minimum (spoiler: it is at ) we use the GA package (because GA only maximizes we use the minus sign in front of the fitness function): set.seed(70) GA <- ga(type = "real-valued", fitness = function(x) -Rastrigin(x[1], x[2]), lower = c(-5.12, -5.12), upper = c(5.12, 5.12), maxiter = 1000) summary(GA) ## -- Genetic Algorithm ------------------- ## ## GA settings: ## Type = real-valued ## Population size = 50 ## Number of generations = 1000 ## Elitism = 2 ## Crossover probability = 0.8 ## Mutation probability = 0.1 ## Search domain = ## x1 x2 ## lower -5.12 -5.12 ## upper 5.12 5.12 ## ## GA results: ## Iterations = 1000 ## Fitness function value = -3.630204e-07 ## Solution = ## x1 x2 ## [1,] 2.81408e-05 3.221658e-05 plot(GA)  filled.contour(x1, x2, f, color.palette = bl2gr.colors, plot.axes = { axis(1); axis(2); points(GA@solution[ , 1], GA@solution[ , 2], pch = 3, cex = 2, col = "white", lwd = 2) } )  Quite impressive, isn’t it! Evolution just works! In an upcoming post we will use evolutionary methods to find a nice functional form for some noisy data with a method called symbolic regression or genetic programming – so stay tuned! ## Customers who bought… One of the classic examples in data science (called data mining at the time) is the beer and diapers example: when a big supermarket chain started analyzing their sales data they encountered not only trivial patterns, like toothbrushes and toothpaste being bought together, but also quite strange combinations like beer and diapers. Now, the trivial ones are reassuring that the method works but what about the more extravagant ones? Does it mean that young parents are alcoholics? Or that instead of breastfeeding they give their babies beer? Obviously, they had to get to the bottom of this. As it turned out in many cases they following happened: stressed out mummy sends young daddy to the supermarket because they had run out of diapers. Young daddy seizes the opportunity to not only buy the much needed diapers but also to stock up on some beer! So what the supermarket did was to place the beer directly on the way from the aisle with the diapers – the result was a significant boost in beer sales (for all the young daddies who might have forgotten what they really wanted when buying diapers…). So, to reproduce this example in a simplified way have a look at the following code: # some example data for items bought together (market baskets) Diapers <- c(1, 0, 1, 1, 0, 1, 1, 0, 1, 1, 0, 0, 1, 0, 0) Baby_Oil <- c(1, 1, 1, 0, 0, 1, 0, 0, 1, 1, 1, 1, 1, 0, 0) Ham <- c(rep(0, 6), rep(1, 2), rep(0, 7)) Beer <- c(rep(0, 3), 1, rep(0, 11)) (basket <- cbind(Diapers, Baby_Oil, Ham, Beer)) ## Diapers Baby_Oil Ham Beer ## [1,] 1 1 0 0 ## [2,] 0 1 0 0 ## [3,] 1 1 0 0 ## [4,] 1 0 0 1 ## [5,] 0 0 0 0 ## [6,] 1 1 0 0 ## [7,] 1 0 1 0 ## [8,] 0 0 1 0 ## [9,] 1 1 0 0 ## [10,] 1 1 0 0 ## [11,] 0 1 0 0 ## [12,] 0 1 0 0 ## [13,] 1 1 0 0 ## [14,] 0 0 0 0 ## [15,] 0 0 0 0 # analysis of items bought together round(cor_basket <- cor(basket), 2) # cor is the core of the method! (no pun intended) ## Diapers Baby_Oil Ham Beer ## Diapers 1.00 0.33 -0.03 0.25 ## Baby_Oil 0.33 1.00 -0.48 -0.33 ## Ham -0.03 -0.48 1.00 -0.10 ## Beer 0.25 -0.33 -0.10 1.00 diag(cor_basket) <- 0 # we don't want to recommend the same products to the customers who already bought them round(cor_basket, 2) ## Diapers Baby_Oil Ham Beer ## Diapers 0.00 0.33 -0.03 0.25 ## Baby_Oil 0.33 0.00 -0.48 -0.33 ## Ham -0.03 -0.48 0.00 -0.10 ## Beer 0.25 -0.33 -0.10 0.00 # printing items bought together for (i in 1:ncol(cor_basket)) { col <- cor_basket[ , i, drop = FALSE] col <- col[order(col, decreasing = TRUE), , drop = FALSE] cat("Customers who bought", colnames(col), "also bought", rownames(col)[col > 0], "\n") } ## Customers who bought Diapers also bought Baby_Oil Beer ## Customers who bought Baby_Oil also bought Diapers ## Customers who bought Ham also bought ## Customers who bought Beer also bought Diapers  What we are looking for is some kind of dependance pattern within the shopping baskets, in this case we use the good old correlation function. Traditionally other (dependance) measures are used, namely support, confidence and lift. We will come to that later on in this post. Wikipedia offers the following fitting description of association rule learning: Association rule learning is a rule-based machine learning method for discovering interesting relations between variables in large databases. It is intended to identify rules discovered in databases using some measures of interestingness. For example, the rule found in the sales data of a supermarket would indicate that if a customer buys onions and potatoes together, they are likely to also buy hamburger meat. Such information can be used as the basis for decisions about marketing activities such as, e.g. promotional pricing or product placements. In addition to the above example from market basket analysis association rules are employed today in many application areas including Web usage mining, intrusion detection, continuous production, and bioinformatics. So, this is also the basis of popular functions on ecommerce sites (“customers who bought this item also bought…”) or movie streaming platforms (“customers who watched this film also watched…”). A very good package for real-world datasets is the arules package (on CRAN). Have a look at the following code: library(arules) ## Loading required package: Matrix ## ## Attaching package: 'arules' ## The following objects are masked from 'package:base': ## ## abbreviate, write data("Groceries") rules <- apriori(Groceries, parameter = list(supp = 0.001, conf = 0.5)) ## Apriori ## ## Parameter specification: ## confidence minval smax arem aval originalSupport maxtime support minlen ## 0.5 0.1 1 none FALSE TRUE 5 0.001 1 ## maxlen target ext ## 10 rules FALSE ## ## Algorithmic control: ## filter tree heap memopt load sort verbose ## 0.1 TRUE TRUE FALSE TRUE 2 TRUE ## ## Absolute minimum support count: 9 ## ## set item appearances ...[0 item(s)] done [0.00s]. ## set transactions ...[169 item(s), 9835 transaction(s)] done [0.00s]. ## sorting and recoding items ... [157 item(s)] done [0.00s]. ## creating transaction tree ... done [0.00s]. ## checking subsets of size 1 2 3 4 5 6 done [0.02s]. ## writing ... [5668 rule(s)] done [0.00s]. ## creating S4 object ... done [0.00s]. rules_conf <- arules::sort(rules, by = "confidence", decreasing = TRUE) inspect(head(rules_conf, 10)) ## lhs rhs support confidence lift count ## [1] {rice, ## sugar} => {whole milk} 0.001220132 1 3.913649 12 ## [2] {canned fish, ## hygiene articles} => {whole milk} 0.001118454 1 3.913649 11 ## [3] {root vegetables, ## butter, ## rice} => {whole milk} 0.001016777 1 3.913649 10 ## [4] {root vegetables, ## whipped/sour cream, ## flour} => {whole milk} 0.001728521 1 3.913649 17 ## [5] {butter, ## soft cheese, ## domestic eggs} => {whole milk} 0.001016777 1 3.913649 10 ## [6] {citrus fruit, ## root vegetables, ## soft cheese} => {other vegetables} 0.001016777 1 5.168156 10 ## [7] {pip fruit, ## butter, ## hygiene articles} => {whole milk} 0.001016777 1 3.913649 10 ## [8] {root vegetables, ## whipped/sour cream, ## hygiene articles} => {whole milk} 0.001016777 1 3.913649 10 ## [9] {pip fruit, ## root vegetables, ## hygiene articles} => {whole milk} 0.001016777 1 3.913649 10 ## [10] {cream cheese , ## domestic eggs, ## sugar} => {whole milk} 0.001118454 1 3.913649 1  The algorithm used here is the so called Apriori algorithm. It ameliorates the problem with real-world datasets that when you want to test all combinations of all possible items you very soon run into performance problems – even with very fast computers – because there are just too many possibilities to be tested. The Apriori algorithm aggressively prunes the possibilities to be tested by making use of the fact that if you are only interested in rules that are supported by a certain number of instances you can start with testing the support of individual items – which is easy to do – and work your way up to more complicated rules. The trick is that you don’t test more complicated rules with items which don’t have enough support on the individual level: this is because if you don’t have enough instances on the individual level you don’t even have to look at more complicated combinations with those items included (which would be even more scarce). What sounds like an obvious point brings about huge time savings for real-world datasets which couldn’t be analyzed without this trick. As mentioned above important concepts to assess the quality (or interestingness) of association rules are support, confidence and lift: • Support of : percentage of X for all cases • Confidence of : percentage of Y for all X • Lift of : ratio of the observed support of X and Y to what would be expected if X and Y were independent To understand these concepts better we are going to rebuild the examples given in the Wikipedia article in R. Have a look at the parts “Definition” and “Useful Concepts” of the article and after that at the following code, which should be self-explanatory: M <- matrix(c(1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 1, 0, 0, 0), ncol = 5, byrow = TRUE) colnames(M) <- c("milk", "bread", "butter", "beer", "diapers") M ## milk bread butter beer diapers ## [1,] 1 1 0 0 0 ## [2,] 0 0 1 0 0 ## [3,] 0 0 0 1 1 ## [4,] 1 1 1 0 0 ## [5,] 0 1 0 0 0 supp <- function(X) { sum(rowSums(M[ , X, drop = FALSE]) == length(X)) / nrow(M) # "rowSums == length" mimics logical AND for the selected columns } conf <- function(X, Y) { supp(c(X, Y)) / supp(X) # conf(X => Y) } lift <- function(X, Y) { supp(c(X, Y)) / (supp(X) * supp(Y)) # lift(X => Y) } supp(c("beer", "diapers")) # percentage of X for all cases ## [1] 0.2 conf(c("butter", "bread"), "milk") # percentage of Y for all X ## [1] 1 lift(c("milk", "bread"), "butter") # ratio of the observed support of X and Y to what would be expected if X and Y were independent ## [1] 1.25  You should conduct your own experiments by playing around with different item combinations so that you really understand the mechanics of those important concepts. If all of those analyses are being done for perfecting your customer experience or just outright manipulation to lure you into buying stuff you don’t really need is obviously a matter of perspective… ## To understand recursion you have to understand recursion… Sorting values is one of the bread and butter tasks in computer science: this post uses it as a use case to learn what recursion is all about. It starts with some nerd humour… and ends with some more, so read on! Have a look at the following code and try to understand it: bogosort <- function(x) { while(is.unsorted(x)) x <- sample(x) x } set.seed(123) (n <- sample(1:100, 10)) ## [1] 29 79 41 86 91 5 50 83 51 42 bogosort(n) ## [1] 5 29 41 42 50 51 79 83 86 91  Obviously bogosort did its job… but in a ludicrously inefficient way! It just shuffles all values over and over until they are sorted by chance! In the worst case it just shuffles forever (i.e. until the sun swallows the earth or your partner switches the computer off… whatever comes first). To create a more efficient sorting algorithm one could use the following strategy: 1. Draw the smallest number, put it into a vector and 2. Sort the remaining numbers by going back to 1. The algorithm (which is called selection sort, see below) effectively sorts by calling itself! This is the main idea behind recursion. To dive deeper into how recursion works watch the following video: What on Earth is Recursion? – Computerphile So, basically Recursion means defining a function in terms of itself! Additionally you need some stopping criterion or base case, otherwise you would create an infinite loop. The recursive version of the factorial function is: In the video this formula is written in C, here is the R version: fact <- function(n) { if (n == 1) 1 else n * fact(n-1) } fact(4) ## [1] 24 factorial(4) #inbuilt factorial function ## [1] 24  After we have seen how recursion actually works, let us implement the above selection sort algorithm in R. First watch this, admittedly unconventional, video: Select-sort with Gypsy folk dance. We now implement the main idea recursively: selsort <- function(x) { if(length(x) > 1) { mini <- which.min(x) c(x[mini], selsort(x[-mini])) } else x } set.seed(123) (v <- sample(1:100, 20)) ## [1] 29 79 41 86 91 5 50 83 51 42 87 98 60 94 9 77 21 4 27 78 selsort(v) ## [1] 4 5 9 21 27 29 41 42 50 51 60 77 78 79 83 86 87 91 94 98  Another, even more efficient sorting algorithm, quicksort, also works recursively. Watch the following video to get the idea: Quick-sort with Hungarian folk dance. So, the main idea behind the quicksort algorithm is to find a pivot for each unsorted subset of numbers which splits it into two similarly sized subsets. Then the sorting function calls itself for each subset. Have a look at the following code: qsort <- function(v) { if (length(v) > 1) { pivot <- (min(v) + max(v)) / 2 c(qsort(v[v < pivot]), v[v == pivot], qsort(v[v > pivot])) } else v } qsort(v) ## [1] 4 5 9 21 27 29 41 42 50 51 60 77 78 79 83 86 87 91 94 98  Now for some speed comparisons between selection sort, quicksort and R’s sort function (on my lame computer…): options(expressions = 7500) N <- 7000 set.seed(123) vs <- runif(N) system.time(u <- selsort(vs)) ## user system elapsed ## 0.59 0.14 0.73 system.time(u <- qsort(vs)) ## user system elapsed ## 0.01 0.00 0.01 system.time(u <- sort(vs)) ## user system elapsed ## 0 0 0  Wow, although quicksort was implemented in R (and not in C as the sort function) it is impressively fast. Why? Because each subset of unsorted numbers is again split into two subsets only about half as big. This pushes down the sizes of subsets that still have to be sorted down pretty fast. In the case of selection sort the subset only gets smaller by one number (the smallest one) in each step. Let us end this post with a little easter egg from google – do you get it? ## So, what is AI really? One of the topics that is totally hyped at the moment is obviously Artificial Intelligence or AI for short. There are many self-proclaimed experts running around trying to sell you the stuff they have been doing all along under this new label. When you ask them what AI means you will normally get some convoluted explanations (which is a good sign that they don’t get it themselves) and some “success stories”. The truth is that many of those talking heads don’t really know what they are talking about, yet happen to have a friend who knows somebody who picked up a book at the local station bookshop… ok, that was nasty but unfortunately often not too far away from the truth. So, what is AI really? This post tries to give some guidance. The traditional way coding a computer program worked was through carefully analyzing the problem, trying to separate its different parts into simpler sub-problems, put the solution into an algorithmic form (kind of a recipe) and finally code it. Let’s have a look at an example! Let’s say you want to program an app for mushroom pickers with warnings for certain qualities. The idea is to find an easy to follow guide in the form of “if your mushroom has this quality then DO NOT eat it!” (in computer lingo called “conditional statements” or just “conditionals”). As any mushroom picker can attest this is not an easy task. For the matter have a look at the following dataset (you can find it here: mushrooms, originally it is from https://archive.ics.uci.edu/ml/datasets/mushroom): mushrooms <- read.csv("data/mushrooms.csv") str(mushrooms) ## 'data.frame': 8124 obs. of 23 variables: ##$ cap_shape               : Factor w/ 6 levels "bell","conical",..: 3 3 1 3 3 3 1 1 3 1 ...
##  $cap_surface : Factor w/ 4 levels "fibrous","grooves",..: 4 4 4 3 4 3 4 3 3 4 ... ##$ cap_color               : Factor w/ 10 levels "brown","buff",..: 1 10 9 9 4 10 9 9 9 10 ...
##  $bruises : Factor w/ 2 levels "no","yes": 2 2 2 2 1 2 2 2 2 2 ... ##$ odor                    : Factor w/ 9 levels "almond","anise",..: 8 1 2 8 7 1 1 2 8 1 ...
##  $gill_attachment : Factor w/ 2 levels "attached","free": 2 2 2 2 2 2 2 2 2 2 ... ##$ gill_spacing            : Factor w/ 2 levels "close","crowded": 1 1 1 1 2 1 1 1 1 1 ...
##  $gill_size : Factor w/ 2 levels "broad","narrow": 2 1 1 2 1 1 1 1 2 1 ... ##$ gill_color              : Factor w/ 12 levels "black","brown",..: 1 1 2 2 1 2 5 2 8 5 ...
##  $stalk_shape : Factor w/ 2 levels "enlarging","tapering": 1 1 1 1 2 1 1 1 1 1 ... ##$ stalk_root              : Factor w/ 5 levels "bulbous","club",..: 3 2 2 3 3 2 2 2 3 2 ...
##  $stalk_surface_above_ring: Factor w/ 4 levels "fibrous","scaly",..: 4 4 4 4 4 4 4 4 4 4 ... ##$ stalk_surface_below_ring: Factor w/ 4 levels "fibrous","scaly",..: 4 4 4 4 4 4 4 4 4 4 ...
##  $stalk_color_above_ring : Factor w/ 9 levels "brown","buff",..: 8 8 8 8 8 8 8 8 8 8 ... ##$ stalk_color_below_ring  : Factor w/ 9 levels "brown","buff",..: 8 8 8 8 8 8 8 8 8 8 ...
##  $veil_type : Factor w/ 1 level "partial": 1 1 1 1 1 1 1 1 1 1 ... ##$ veil_color              : Factor w/ 4 levels "brown","orange",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ring_number : Factor w/ 3 levels "none","one","two": 2 2 2 2 2 2 2 2 2 2 ... ##$ ring_type               : Factor w/ 5 levels "evanescent","flaring",..: 5 5 5 5 1 5 5 5 5 5 ...
##  $spore_print_color : Factor w/ 9 levels "black","brown",..: 1 2 2 1 2 1 1 2 1 1 ... ##$ population              : Factor w/ 6 levels "abundant","clustered",..: 4 3 3 4 1 3 3 4 5 4 ...
##  $habitat : Factor w/ 7 levels "grasses","leaves",..: 5 1 3 5 1 1 3 3 1 3 ... ##$ type                    : Factor w/ 2 levels "edible","poisonous": 2 1 1 2 1 1 1 1 2 1 ...

##   cap_shape cap_surface cap_color bruises    odor gill_attachment
## 1    convex      smooth     brown     yes pungent            free
## 2    convex      smooth    yellow     yes  almond            free
## 3      bell      smooth     white     yes   anise            free
## 4    convex       scaly     white     yes pungent            free
## 5    convex      smooth      gray      no    none            free
## 6    convex       scaly    yellow     yes  almond            free
##   gill_spacing gill_size gill_color stalk_shape stalk_root
## 1        close    narrow      black   enlarging      equal
## 2        close     broad      black   enlarging       club
## 3        close     broad      brown   enlarging       club
## 4        close    narrow      brown   enlarging      equal
## 5      crowded     broad      black    tapering      equal
## 6        close     broad      brown   enlarging       club
##   stalk_surface_above_ring stalk_surface_below_ring stalk_color_above_ring
## 1                   smooth                   smooth                  white
## 2                   smooth                   smooth                  white
## 3                   smooth                   smooth                  white
## 4                   smooth                   smooth                  white
## 5                   smooth                   smooth                  white
## 6                   smooth                   smooth                  white
##   stalk_color_below_ring veil_type veil_color ring_number  ring_type
## 1                  white   partial      white         one    pendant
## 2                  white   partial      white         one    pendant
## 3                  white   partial      white         one    pendant
## 4                  white   partial      white         one    pendant
## 5                  white   partial      white         one evanescent
## 6                  white   partial      white         one    pendant
##   spore_print_color population habitat      type
## 1             black  scattered   urban poisonous
## 2             brown   numerous grasses    edible
## 3             brown   numerous meadows    edible
## 4             black  scattered   urban poisonous
## 5             brown   abundant grasses    edible
## 6             black   numerous grasses    edible


The dataset consists of 8124 examples of mushrooms with 22 qualities each (plus the attribute whether the respective mushroom is edible or poisonous). Well, obviously this is not going to be easy…

A naive approach would be to formulate rules for every instance: if the cap shape is convex and the cap surface smooth and the cap colour brown… and so on for all 22 attributes, then DO NOT eat it! This would obviously not be very helpful. Another approach would be to go through every attribute and see whether it is helpful in determining the type, so for example:

table(mushrooms$cap_shape, mushrooms$type)
##
##           edible poisonous
##   bell       404        48
##   conical      0         4
##   convex    1948      1708
##   flat      1596      1556
##   knobbed    228       600
##   sunken      32         0


Obviously this attribute isn’t very helpful, in many cases it just gives a “maybe, maybe not”-answer. Perhaps the approach itself is not so bad after all but you would have to try it for all 22 attributes, interpret the results, pick the best one, formulate if-then-rules and code them… tiresome and error-prone.

Wouldn’t it be nice to do it the other way around: just show the computer all of the examples and it magically programs itself by finding the appropriate if-then-rules automatically? This is what AI is all about:

Artificial Intelligence (AI): Showing a computer examples of a problem so that it programs itself to solve it.

So, let us throw AI at our problem in the form of the OneR package (on CRAN):

library(OneR)
OneR(mushrooms, verbose = TRUE)
##
##     Attribute                Accuracy
## 1 * odor                     98.52%
## 2   spore_print_color        86.8%
## 3   gill_color               80.5%
## 4   ring_type                77.55%
## 5   stalk_surface_above_ring 77.45%
## 6   stalk_surface_below_ring 76.61%
## 7   gill_size                75.63%
## 8   bruises                  74.4%
## 9   population               72.18%
## 10  stalk_color_above_ring   71.64%
## 11  stalk_color_below_ring   71.44%
## 12  habitat                  69.03%
## 13  stalk_root               64.6%
## 14  gill_spacing             61.6%
## 15  cap_color                59.53%
## 16  cap_surface              58.05%
## 17  cap_shape                56.43%
## 18  stalk_shape              55.29%
## 19  ring_number              53.82%
## 20  veil_color               51.9%
## 21  gill_attachment          51.8%
## 21  veil_type                51.8%
## ---
## Chosen attribute due to accuracy
## and ties method (if applicable): '*'
##
## Call:
## OneR.data.frame(x = mushrooms, verbose = TRUE)
##
## Rules:
## If odor = almond   then type = edible
## If odor = anise    then type = edible
## If odor = creosote then type = poisonous
## If odor = fishy    then type = poisonous
## If odor = foul     then type = poisonous
## If odor = musty    then type = poisonous
## If odor = none     then type = edible
## If odor = pungent  then type = poisonous
## If odor = spicy    then type = poisonous
##
## Accuracy:
## 8004 of 8124 instances classified correctly (98.52%)


Wow! Within the blink of an eye and with just one command (OneR) we got all the rules we need for our app! It is the odour: if it smells poisonous it probably is. The accuracy of this is nearly 99%, not too bad for such a simple rule… we wouldn’t even need an app for that.

For more examples, some deeper explanation (and even a video) on the OneR package go here: OneR – Establishing a New Baseline for Machine Learning Classification Models.

By the way, in the words “programs itself” – impressive as it may be – is still the term “programming”, so we are not talking about machines developing intentions, feelings or even consciousness anytime soon. This is the domain of Hollywood and not AI!

As with every hype there are many terms flying around, like Machine Learning (ML), Data Science and (Predictive) Analytics and somewhat older terms, like Data Mining, Knowledge Discovery and Business Intelligence… and many, many more. Of course you can define them in all sorts of ways but to be honest with you in essence they all mean the same (see definition above). I would argue that Data Science is a somewhat broader term which comprises also e.g. the handling of data, interpretation by domain experts and presentation to the business but that is just one definition. There is an old joke that what is Machine Learning when put on powerpoint becomes Artificial Intelligence – it just sounds so much better than any technical term.

We can now answer another question: why now? Why is there a hype now? I will share a secret with you: most of the AI methods used today are quite old! For example the core principles of (artificial) neural networks (a.k.a. deep learning) are from the 60’s of the last century! Now, more than half a century later we’ve got the hype. The reason is:

It’s the data, stupid!

Because AI is all about learning from examples, we need lots and lots of data and because of the internet and mobile revolution we are now drowning in those data. Combine this with more and more powerful hardware (also in the form of cheap cloud services) and you’ve got a revolution at your hands.

And there is yet another lesson to be learned: when I said “we” are drowning in data it was not entirely correct: tech companies like Google, Amazon, Facebook, Apple (GAFA) and their Chinese counterparts, like Baidu, Alibaba, and Tencent (BAT) are drowning in our (!) data. This is the reason why they are also leading the field of AI and not because they necessarily have the brightest AI geniuses.

This point is best illustrated by the example of DeepL: a small German startup in the area of machine translation. Many tests conducted by professional translators came to the conclusion that the translations by DeepL are far better than any other machine translations (like Google Translate). Why? Not because there necessarily is a secret sauce in their algorithms but because the company behind it (Linguee) had been amassing hand-curated bilingual language pairs for many, many years. The quality of their data is just so much better than of all the other translation services out there. This is their main differentiator.

That is not to say that there haven’t been any major developments in the area of AI algorithms or computer hardware or that you don’t need bright people to set up and finetune those systems (quite to the contrary!) but many of the algorithms are open-source and computing power is relatively cheap via cloud services and, yes, there is a shortage of good people at the moment but there are still many highly qualified data scientists around: the real difference is in (the quality and amount of) the data!

One last thing: many people think that AI is more objective than humans, after all its maths and cool logic, right? Wrong! When you only learn from examples and the examples given are racist the learned rules will be racist too! A very good book on the subject is Weapons of Math Destruction by Cathy O’Neal.

Hope that this gave you some perspective on the ongoing hype… perhaps hype is not such a good word after all because when you look at the underlying reasons you can see that this mega trend is here to stay!

## Learning Data Science: Modelling Basics

Data Science is all about building good models, so let us start by building a very simple model: we want to predict monthly income from age (in a later post we will see that age is indeed a good predictor for income).

For illustrative purposes we just make up some numbers for age and income, load both into a data frame, plot it and calculate its correlation matrix:

age <- c(21, 46, 55, 35, 28)
income <- c(1850, 2500, 2560, 2230, 1800)
data <- data.frame(age, income)
plot(data, pch = 16)


cor(data) # correlation
##              age    income
## age    1.0000000 0.9464183
## income 0.9464183 1.0000000


Just by looking at the plot we can see that the points seem to be linearly dependent. The correlation matrix corroborates this with a correlation of nearly . To build a linear regression model we use the lm function (for linear model). To specify the model we use the standard R formula interface: on the left hand side comes the dependent variable, i.e. the variable that we want to predict (income) and on the right hand side comes the independent variable, i.e. the variable that we want to use to predict income, viz. age – in between both variables we put a tilde (~). After that we plot the model as a line and look at some model diagnostics:

# building linear regression model
LinReg <- lm(income ~ age) # assign model to variable
abline(LinReg, col = "green", lwd = 2) # add regression line


LinReg # coefficients
##
## Call:
## lm(formula = income ~ age)
##
## Coefficients:
## (Intercept)          age
##     1279.37        24.56

summary(LinReg) # model summary
##
## Call:
## lm(formula = income ~ age)
##
## Residuals:
##       1       2       3       4       5
##   54.92   90.98  -70.04   91.12 -166.98
##
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1279.367    188.510   6.787  0.00654 **
## age           24.558      4.838   5.076  0.01477 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 132.1 on 3 degrees of freedom
## Multiple R-squared:  0.8957, Adjusted R-squared:  0.8609
## F-statistic: 25.77 on 1 and 3 DF,  p-value: 0.01477


The general line equation is , where is the slope (in this case ) and the intercept (here ). The p-values in the last column show us that both parameters are significant, so with a high probability not due to randomness. If we now want to predict the income of a 20-year old we just have to calculate .

To do the same calculation in R we can conveniently use the predict function which does this calculation for us automatically. We can also use vectors as input and get several predictions at once. Important is that we use the independent variable with its name as a data frame, otherwise R will throw an error.

# predicting income with linear model
predict(LinReg, data.frame(age = 20))
##       1
## 1770.52
pred_LinReg <- predict(LinReg, data.frame(age = seq(from = 0, to = 80, by = 5)))
names(pred_LinReg) <- seq(0, 80, 5)
pred_LinReg
##        0        5       10       15       20       25       30       35
## 1279.367 1402.155 1524.944 1647.732 1770.520 1893.308 2016.097 2138.885
##       40       45       50       55       60       65       70       75
## 2261.673 2384.461 2507.249 2630.038 2752.826 2875.614 2998.402 3121.190
##       80
## 3243.979


So far so good, but wouldn’t it be nice to actually have a model that is even more accurate, a model where the line actually goes through all the data points so that all the available information is being used. Let us do that with a so called polynomial regression (in red):

# polynomial regression
PolyReg <- lm(income ~ poly(age, 4, raw = TRUE))
lines(c(20:55), predict(PolyReg, data.frame(age = c(20:55))), col = "red")


pred_PolyReg <- predict(PolyReg, data.frame(age = seq(0, 80, 5)))
names(pred_PolyReg) <- seq(0, 80, 5)
pred_PolyReg
##         0         5        10        15        20        25        30
## 19527.182 11103.509  5954.530  3164.488  1959.098  1705.541  1912.468
##        35        40        45        50        55        60        65
##  2230.000  2449.726  2504.705  2469.464  2560.000  3133.778  4689.732
##        70        75        80
##  7868.267 13451.255 22362.037


What do you think? Is this a better model? Well, there is no good reason why income should go up the younger or older you are, and there is no good reason why there should be a bump between and . In other words, this model takes the noise as legitimate input which actually makes it worse! To have a good model you always need some rigidity to generalize the data.

This illustration is a good example of one of the big problems of machine learning: overfitting the data! The other extreme would be underfitting, which is included in the following plot as a blue line:

# mean income as a model
abline(h = mean(income), col = "blue")

title(main = "All models are wrong, some are useful")
legend("bottomright", c("Mean: Underfitting", "Linear fit", "Polynomial Reg.: Overfitting"), col = c("blue", "green", "red"), lwd = 3)

mean(income)
## [1] 2188


We just took the mean value of income and claimed that no matter the age everybody would earn about . This also obviously doesn’t make the cut. In practice and in all data science projects you always have to look for a model that hits the sweet spot between over- and underfitting. This is as much an art as a science, although there are certain best practices like dividing the data into a training and a test set, cross validation etc.

Another important point it that all models have their area of applicability. In physics for example quantum physics works well for the very small but not very good for the very big, it is the other way round for relativity theory. In this case we didn’t have data for people of very young or very old age, so it is in general very dangerous to make prediction for those people, i.e. for values outside of the original observation range (also called extrapolation – in contrast to interpolation = predictions within the original range). In case of the linear model we got an income of about for newborns (the intercept)! And the income gets higher the older people get, also something that doesn’t make sense. This behaviour was a lot worse for the non-linear model.

So, here you can see that it is one thing to just let R run and spit something out and quite another one to find a good model and interpret it correctly – even for such a simple problem as this one. In a later post we will use real-world data to predict income brackets with OneR, decision trees and random forests, so stay tuned!

The post on now online: Learning Data Science: Predicting Income Brackets!

## Hash Me If You Can

We are living in the era of Big Data but the problem of course is that the bigger our data sets become the slower even simple search operations get. I will now show you a trick that is the next best thing to magic: building a search function that practically doesn’t slow down even for large data sets… in base R!

On first thought this is totally counterintuitive: the bigger the data set is, the longer it should take to search it, right? Wrong!

The data structure we will be talking about is called a hash or a dictionary (sometimes also called associated array). The big idea is to use a mathematical function (called hash function) which maps each data item (e.g. a name) to an address (called hash) where the corresponding value (e.g. a telephone number) is stored. So to find the telephone number for a certain name you don’t have to search through all the names but you just put it into the hash function and you get back the address for the telephone number instantaneously:

This image is from the very good wikipedia article on hash function algorithms: Hash function. It also gives a trivial example of a hash function to get the idea:

If the data to be hashed is small enough, one can use the data itself (reinterpreted as an integer) as the hashed value. The cost of computing this “trivial” (identity) hash function is effectively zero. This hash function is perfect, as it maps each input to a distinct hash value.

To do this with R we use environments with the option hash = TRUE. The following example is from an answer I gave on stackoverflow (https://stackoverflow.com/a/42350672/468305):

# vectorize assign, get and exists for convenience
assign_hash <- Vectorize(assign, vectorize.args = c("x", "value"))
get_hash <- Vectorize(get, vectorize.args = "x")
exists_hash <- Vectorize(exists, vectorize.args = "x")

# keys and values
key <- c("tic", "tac", "toe")
value <- c(1, 22, 333)

# initialize hash
hash <- new.env(hash = TRUE, parent = emptyenv(), size = 100L)

# assign values to keys
assign_hash(key, value, hash)
## tic tac toe
##   1  22 333

# get values for keys
get_hash(c("toe", "tic"), hash)
## toe tic
## 333   1

# alternatively:
mget(c("toe", "tic"), hash)
## $toe ## [1] 333 ## ##$tic
## [1] 1

# show all keys
ls(hash)
## [1] "tac" "tic" "toe"

# show all keys with values
get_hash(ls(hash), hash)
## tac tic toe
##  22   1 333

# remove key-value pairs
rm(list = c("toe", "tic"), envir = hash)
get_hash(ls(hash), hash)
## tac
##  22

# check if keys are in hash
exists_hash(c("tac", "nothere"), hash)
##     tac nothere
##    TRUE   FALSE

# for single keys this is also possible:
# show value for single key
hash[["tac"]]
## [1] 22

# create new key-value pair
hash[["test"]] <- 1234
get_hash(ls(hash), hash)
##  tac test
##   22 1234

# update single value
hash[["test"]] <- 54321
get_hash(ls(hash), hash)
##   tac  test
##    22 54321


So you see that using the inbuilt hash functionality is quite simple. To get an idea of the performance boost there is a very thorough article here: http://jeffreyhorner.tumblr.com/post/114524915928/hash-table-performance-in-r-part-i.

Have a look at the following plot from the article:

Horner writes:

Bam! See that blue line? That’s near constant time for searching the entire size hash table!

If I could whet your appetite I want to close with a pointer to a much more professional implementation of hash tables using environments: https://CRAN.R-project.org/package=hash

I haven’t tried the package myself so far but the author, Christopher Brown, promises:

The hash package is the only full featured hash implementation for the R language. It provides more features and finer control of the hash behavior than the native feature set and has similar and sometimes better performance.

## If wealth had anything to do with intelligence…

…the richest man on earth would have a fortune of no more than $43,000! If you don’t believe me read this post! Have you ever thought about the distribution of wealth as a function of some quality? Especially rich people pride themselves on extraordinary abilities, so that they somehow “deserve” their wealth. Now “abilities” is somewhat hard to measure so let us take “intelligence” as a proxy. Intelligence is distributed normally with mean = 100 and standard deviation = 15. An interesting question is what is the expected maximum intelligence of all persons alive today? The first thought could be to simulate this but as it soon turns out this approach doesn’t work because you had to create and analyze a vector with nearly 8 billion elements – R will balk at this. So we will have to look for an analytical solution – fortunately, asking my colleague Prof. Dr. Google brings up the following mathunderflow question and answers: Expected value for maximum of n normal random variable. According to this the expected maximum of independent is: where (lowercase phi) and (uppercase phi) stand for the pdf and cdf of the normal distribution respectively. I spare you the mathematical details of the derivation. Unfortunately there doesn’t seem to be a closed form for the integral so we will solve it numerically by making use of the integral function – first we reproduce the values of the accepted answer: f <- function(x, n) x * dnorm(x) * pnorm(x)^(n-1) for (i in 2:10) print(i * integrate(f, n = i, -Inf, Inf, abs.tol = 1e-20)$value) # test
## [1] 0.5641896
## [1] 0.8462844
## [1] 1.029375
## [1] 1.162964
## [1] 1.267206
## [1] 1.352178
## [1] 1.4236
## [1] 1.485013
## [1] 1.538753


This seems to be good enough. Now, let us calculate the maximum IQ that is to be expected for the current world population:

100 + 15 * world_population * integrate(f, n = world_population, -Inf, Inf, abs.tol = 1e-20)$value # expected max IQ of all people alive ## [1] 196.1035  To compare this with the actual record we find that Marilyn vos Savant seems to be (one of) the most intelligent persons on the planet. When you read the Wikipedia article you will find that there is some debate about what the actual number really is but it seems to be around 200… again, the above approximation seems to be good enough. And now for the final leg: if the wealth distribution followed the same rules as the IQ distribution we could use the same formula to calculate the expected maximum wealth: global_wealth <- 1.68e+14 # source: Allianz Global Wealth Report 2018 world_population <- 7.7e+09 per_capita_wealth <- global_wealth / world_population per_capita_wealth_sd <- per_capita_wealth * 0.15 # equivalent to IQ per_capita_wealth ## [1] 21818.18 per_capita_wealth + per_capita_wealth_sd * world_population * integrate(f, n = world_population, -Inf, Inf, abs.tol = 1e-20)$value
## [1] 42786.21


There you go: not more than $43.000! But the richest man at the moment, Jeff Bezos, has a net worth of about 140 billion US$!!! Conversely that would translate to an IQ of

1.4e+11 / per_capita_wealth * 100
## [1] 641666667


So, compared to an IQ of over 641 million Einstein would be the mental equivalent of an amoeba! If IQ were additive, i.e. if two persons with an IQ of 100 each had an combined IQ of 200, you had to take nearly the whole population of Europe (about 738 million) to match that; even the whole North American continent wouldn’t suffice with its 579 million inhabitants!

But if it is not ability/intelligence that determines the distribution of wealth what else could account for the extreme inequality we perceive in the world?

Stay tuned for more to come…

## Understanding the Magic of Neural Networks

Everything “neural” is (again) the latest craze in machine learning and artificial intelligence. Now what is the magic here?

Let us dive directly into a (supposedly little silly) example: we have three protagonists in the fairy tail little red riding hood, the wolf, the grandmother and the woodcutter. They all have certain qualities and little red riding hood reacts in certain ways towards them. For example the grandmother has big eyes, is kindly and wrinkled – little red riding hood will approach her, kiss her on the cheek and offer her food (the behavior “flirt with” towards the woodcutter is a little sexist but we kept it to reproduce the original example from Jones, W. & Hoskins, J.: Back-Propagation, Byte, 1987). We will build and train a neural network which gets the qualities as inputs and little red riding wood’s behaviour as output, i.e. we train it to learn the adequate behaviour for each quality.

Have a look at the following code and its output including the resulting plot:

library(neuralnet)
library(NeuralNetTools)

# code qualities and actions
qualities <- matrix (c(1, 1, 1, 0, 0, 0,
0, 1, 0, 1, 1, 0,
1, 0, 0, 1, 0, 1), byrow = TRUE, nrow = 3)
colnames(qualities) <- c("big_ears", "big_eyes", "big_teeth", "kindly", "wrinkled", "handsome")
rownames(qualities) <- c("wolf", "grannie", "woodcutter")
qualities
##            big_ears big_eyes big_teeth kindly wrinkled handsome
## wolf              1        1         1      0        0        0
## grannie           0        1         0      1        1        0
## woodcutter        1        0         0      1        0        1

actions <- matrix (c(1, 1, 1, 0, 0, 0, 0,
0, 0, 0, 1, 1, 1, 0,
0, 0, 0, 1, 0, 1, 1), byrow = TRUE, nrow = 3)
colnames(actions) <- c("run_away", "scream", "look_for_woodcutter", "kiss_on_cheek", "approach", "offer_food", "flirt_with")
rownames(actions) <- rownames(qualities)
actions
##            run_away scream look_for_woodcutter kiss_on_cheek approach offer_food flirt_with
## wolf              1      1                   1             0        0          0          0
## grannie           0      0                   0             1        1          1          0
## woodcutter        0      0                   0             1        0          1          1

data <- cbind(qualities, actions)

# train the neural network (NN)
set.seed(123) # for reproducibility
neuralnetwork <- neuralnet(run_away + scream+look_for_woodcutter + kiss_on_cheek + approach +
offer_food + flirt_with ~
big_ears + big_eyes + big_teeth + kindly + wrinkled + handsome,
data = data, hidden = 3, exclude = c(1, 8, 15, 22, 26, 30, 34, 38, 42, 46),
lifesign = "minimal", linear.output = FALSE)
## hidden: 3    thresh: 0.01    rep: 1/1    steps:      48  error: 0.01319  time: 0.01 secs

# plot the NN
par_bkp <- par(mar = c(0, 0, 0, 0)) # set different margin to minimize cutoff text
plotnet(neuralnetwork, bias = FALSE)


par(par_bkp)

# predict actions
round(compute(neuralnetwork, qualities)net.result) ## [,1] [,2] [,3] [,4] [,5] [,6] [,7] ## wolf 1 1 1 0 0 0 0 ## grannie 0 0 0 1 1 1 0 ## woodcutter 0 0 0 1 0 1 1  First the qualities and the actions are coded as binary variables in a data frame. After that the neural network is being trained with the qualities as input and the resulting behaviour as output (using the standard formula syntax). In the neuralnet function a few additional technical arguments are set which details won’t concern us here, they just simplify the process in this context). Then we plot the learned net and test it by providing it with the respective qualities: in all three cases it predicts the right actions. How did it learn those? Let us look at the plot of the net. We see that there are two basic building blocks: neurons and weighted connections between them. We have one neuron for each quality and one neuron for each action. Between both layers we have a so called hidden layer with three neurons in this case. The learned strength between the neurons is shown by the thickness of the lines (whereby ‘black’ means positive and ‘grey’ negative weights). Please have a thorough look at those weights. You might have noticed that although the net didn’t know anything about the three protagonists in our little story it nevertheless correctly built a representation of them: ‘H1’ (for Hidden 1) represents the wolf because its differentiating quality is ‘big teeth’ which leads to ‘run away’, ‘scream’ and ‘look for woodcutter’, by the same logic ‘H2’ is the woodcutter and ‘H3’ is the grandmother. So the net literally learned to connect the qualities with respective actions of little red riding hood by creating a representation of the three protagonists! So an artificial neural network is obviously a network of neurons… so let us have a look at those neurons! Basically they are mathematical abstractions of real neurons in your brain. They consist of inputs and an output. The biologically inspired idea is that when the activation of the inputs surpasses a certain threshold the neuron fires. To be able to learn the neuron must, before summing up the inputs, adjust the inputs so that the output is not just arbitrary but matches some sensible result. What is ‘sensible’ you might ask. In a biological environment the answer is not always so clear cut but in our simple example here the neuron has just to match the output we provide it with (= supervised learning). The following abstraction has all we need, inputs, weights, the sum function, a threshold after that and finally the output of the neuron: Let us talk a little bit about what is going on here intuitively. First every input is taken, multiplied by its weight and all of this is summed up. Some of you might recognize this mathematical operation as a scalar product (also called dot product). Another mathematical definition of a scalar product is the following: That is we multiply the length of two vectors by the cosine of the angle of those two vectors. What has cosine to do with it? The cosine of an angle becomes one when both vectors point into the same direction, it becomes zero when they are orthogonal and minus one when both point into opposite directions. Does this make sense? Well, I give you a litte (albeit crude) parable. When growing up there are basically three stages: first you are totally dependent on your parents, then comes puberty and you are against whatever they say or think and after some years you are truly independent (some never reach that stage…). What does “independent” mean here? It means that you agree with some of the things your parents say and think and you disagree with some other things. During puberty you are as dependent on your parents as during being a toddler – you just don’t realize that but in reality you, so to speak, only multiply everything your parents say or think times minus one! What is the connection with cosine? Well, as a toddler both you and your parents tend to be aligned which gives one, during puberty both of you are aligned but in opposing directions which gives minus one and only as a grown up you are both independent which mathematically means that your vector in a way points in both directions at the same time which is only possible when it is orthogonal on the vector of your parents (you entered a new dimension, literally) – and that gives zero for the cosine. So cosine is nothing but a measure of dependence – as is correlation by the way. So this setup ensures that the neuron learns the dependence (or correlation) structure between the inputs and the output! The step function is just a way to help it to decide on which side of the fence it wants to sit, to make the decision clearer whether to fire or not. To sum it up, an artificial neuron is a non-linear function (in this case a step function) on a scalar product of the inputs (fixed) and the weights (adaptable to be able to learn). By adapting the weights the neuron learns the dependence structure between inputs and output. In R you code this idea of an artificial neuron as follows: neuron <- function(input) ifelse(weights %*% input > 0, 1, 0)  Now let us use this idea in R by training an artificial neuron to classify points in a plane. Have a look at the following table: Input 1 Input 2 Output 1 0 0 0 0 1 1 1 0 0 1 1 If you plot those points with the colour coded pattern you get the following picture: The task for the neuron is to find a separating line and thereby classify the two groups. Have a look at the following code: # inspired by Kubat: An Introduction to Machine Learning, p. 72 plot_line <- function(w, col = "blue", add = TRUE) curve(-w[1] / w[2] * x - w[3] / w[2], xlim = c(-0.5, 1.5), ylim = c(-0.5, 1.5), col = col, lwd = 3, xlab = "Input 1", ylab = "Input 2", add = add) neuron <- function(input) as.vector(ifelse(input %*% weights > 0, 1, 0)) # step function on scalar product of weights and input eta <- 0.5 # learning rate # examples input <- matrix(c(1, 0, 0, 0, 1, 1, 0, 1), ncol = 2, byrow = TRUE) input <- cbind(input, 1) # bias for intercept of line output <- c(0, 1, 0, 1) weights <- c(0.25, 0.2, 0.35) # random initial weights plot_line(weights, add = FALSE); grid() points(input[ , 1:2], pch = 16, col = (output + 2)) # training of weights of neuron for (example in 1:length(output)) { weights <- weights + eta * (output[example] - neuron(input[example, ])) * input[example, ] plot_line(weights) } plot_line(weights, col = "black")  # test: applying neuron on input apply(input, 1, neuron) ## [1] 0 1 0 1  As you can see the result matches the desired output, graphically the black line is the end result and as you can see it separates the green from the red points: the neuron has learned this simple classification task. The blue lines are where the neuron starts from and where it is during training – they are not able to classify the points correctly. The training, i.e. adapting the weights, takes places in this line: weights <- weights + eta * (output[example] - neuron(input[example, ])) * input[example, ]  The idea is to compare the current output of the neuron with the wanted output, scale that by some learning factor (eta) and modify the weights accordingly. So if the output is too big make the weights smaller and vice versa. Do this for all examples (sometimes you need another loop to train the neuron with the examples several times) and that’s it. That is the core idea behind the ongoing revolution of neural networks! Ok, so far we had a closer look at one part of neural networks, namely the neurons, let us now turn to the network structure (also called network topology). First, why do we need a whole network anyway when the neurons are already able to solve classification tasks? The answer is that they can do that only for very simple problems. For example the neuron above can only distinguish between linearly separable points, i.e. it can only draw lines. It fails in case of the simple problem of four points that are coloured green, red, red, green from top left to bottom right (try it yourself). We would need a non-linear function to separate the points. We have to combine several neurons to solve more complicated problems. The biggest problem you have to overcome when you combine several neurons is how to adapt all the weights. You need a system how to attribute the error at the output layer to all the weights in the net. This had been a profound obstacle until an algorithm called backpropagation (also abbreviated backprop) was invented (or found). We won’t get into the details here but the general idea is to work backwards from the output layers through all of the hidden layers till one reaches the input layer and modify the weights according to their respective contribution to the resulting error. This is done several (sometimes millions of times) for all training examples until one achieves an acceptable error rate for the training data. The result is that you get several layers of abstraction, so when you e.g. want to train a neural network to recognize certain faces you start with the raw input data in the form of pixels, these are automatically combined into abstract geometrical structures, after that the net detects certain elements of faces, like eyes and noses, and finally abstractions of certain faces are being rebuilt by the net. See the following picture (from nivdul.wordpress.com) for an illustration: So far we have only coded very small examples of a neural networks. Real-world examples often have dozens of layers with thousands of neurons so that much more complicated patterns can be learned. The more layers there are the ‘deeper’ a net becomes… which is the reason why the current revolution in this field is called “deep learning” because there are so many hidden layers involved. Let us now look at a more realistic example: predicting whether a breast cell is malignant or benign. Have a look at the following code: library(OneR) data(breastcancer) data <- breastcancer colnames(data) <- make.names(colnames(data)) dataClass <- as.integer(as.numeric(data\$Class) - 1) # for compatibility with neuralnet
data <- na.omit(data)

# Divide training (80%) and test set (20%)
set.seed(12) # for reproducibility
random <- sample(1:nrow(data), 0.8 * nrow(data))
data_train <- data[random, ]
data_test <- data[-random, ]

# Train NN on training set
model_train <- neuralnet(Class ~., data = data_train, hidden = c(9, 9), lifesign = "minimal")
## hidden: 9, 9    thresh: 0.01    rep: 1/1    steps:    3784   error: 0.00524  time: 3.13 secs

# Plot net
plot(model_train, rep = "best")


# Use trained model to predict test set
prediction <- round(predict(model_train, data_test))
eval_model(prediction, data_test)
##
## Confusion matrix (absolute):
##           Actual
## Prediction   0   1 Sum
##        0    93   2  95
##        1     4  38  42
##        Sum  97  40 137
##
## Confusion matrix (relative):
##           Actual
## Prediction    0    1  Sum
##        0   0.68 0.01 0.69
##        1   0.03 0.28 0.31
##        Sum 0.71 0.29 1.00
##
## Accuracy:
## 0.9562 (131/137)
##
## Error rate:
## 0.0438 (6/137)
##
## Error rate reduction (vs. base rate):
## 0.85 (p-value = 1.298e-13)


So you see that a relatively simple net achieves an accuracy of about 95% out of sample. The code itself should be mostly self-explanatory. For the actual training the neuralnet function from the package with the same name is being used, the input method is the standard R formula interface, where you define Class as the variable to be predicted by using all the other variables (coded as .~).

When you look at the net one thing might strike you as odd: there are three neurons at the top with a fixed value of 1. These are so called bias neurons and they serve a similar purpose as the intercept in a linear regression: they kind of shift the model as a whole in n-dimensional feature space just as a regression line is being shifted by the intercept. In case you were attentive we also smuggled in a bias neuron in the above example of a single neuron: it is the last column of the input matrix which contains only ones.

Another thing: as can even be seen in this simple example it is very hard to find out what a neural network has actually learned – the following well-known anecdote (urban legend?) shall serve as a warning: some time ago the military built a system which had the aim to distinguish military vehicles from civilian ones. They chose a neural network approach and trained the system with pictures of tanks, humvees and missile launchers on the one hand and normal cars, pickups and lorries on the other. After having reached a satisfactory accuracy they brought the system into the field (quite literally). It failed completely, performing no better than a coin toss. What had happened? No one knew, so they re-engineered the black box (no small feat in itself) and found that most of the military pics where taken at dusk or dawn and most civilian pics under brighter weather conditions. The neural net had learned the difference between light and dark!

Just for comparison the same example with the OneR package:

data(breastcancer)
data <- breastcancer

# Divide training (80%) and test set (20%)
set.seed(12) # for reproducibility
random <- sample(1:nrow(data), 0.8 * nrow(data))
data_train <- optbin(data[random, ], method = "infogain")
## Warning in optbin.data.frame(data[random, ], method = "infogain"): 12
## instance(s) removed due to missing values
data_test <- data[-random, ]

# Train OneR model on training set
model_train <- OneR(data_train, verbose = TRUE)
##
##     Attribute                   Accuracy
## 1 * Uniformity of Cell Size     92.32%
## 2   Uniformity of Cell Shape    91.59%
## 3   Bare Nuclei                 90.68%
## 4   Bland Chromatin             90.31%
## 5   Normal Nucleoli             90.13%
## 6   Single Epithelial Cell Size 89.4%
## 8   Clump Thickness             84.28%
## 9   Mitoses                     78.24%
## ---
## Chosen attribute due to accuracy
## and ties method (if applicable): '*'

# Show model and diagnostics
summary(model_train)
##
## Call:
## OneR.data.frame(x = data_train, verbose = TRUE)
##
## Rules:
## If Uniformity of Cell Size = (0.991,2] then Class = benign
## If Uniformity of Cell Size = (2,10]    then Class = malignant
##
## Accuracy:
## 505 of 547 instances classified correctly (92.32%)
##
## Contingency table:
##            Uniformity of Cell Size
## Class       (0.991,2] (2,10] Sum
##   benign        * 318     30 348
##   malignant        12  * 187 199
##   Sum             330    217 547
## ---
## Maximum in each column: '*'
##
## Pearson's Chi-squared test:
## X-squared = 381.78243, df = 1, p-value < 0.00000000000000022204

# Plot model diagnostics
plot(model_train)


# Use trained model to predict test set
prediction <- predict(model_train, data_test)

# Evaluate model performance on test set
eval_model(prediction, data_test)
##
## Confusion matrix (absolute):
##            Actual
## Prediction  benign malignant Sum
##   benign        92         0  92
##   malignant      8        40  48
##   Sum          100        40 140
##
## Confusion matrix (relative):
##            Actual
## Prediction  benign malignant  Sum
##   benign      0.66      0.00 0.66
##   malignant   0.06      0.29 0.34
##   Sum         0.71      0.29 1.00
##
## Accuracy:
## 0.9429 (132/140)
##
## Error rate:
## 0.0571 (8/140)
##
## Error rate reduction (vs. base rate):
## 0.8 (p-value = 0.000000000007992571)


As you can see the accuracy is only slightly worse but you have full interpretability of the model… and you would only need to measure one value (“Uniformity of Cell Size”) instead of 9 to get a prediction!

On the other hand making neural networks interpretable is one of the big research challenges at the moment.

To end this rather long post: there is a real revolution going on at the moment with all kinds of powerful neural networks. Especially promising is a combination of reinforcement learning (the topic of an upcoming post) and neural networks, where the reinforcement learning algorithm uses a neural network as its memory. For example the revolutionary AlphaGo Zero is built this way: it just received the rules of Go, one of the most demanding strategy games humanity has ever invented, and grew superhuman strength after just three days! The highest human rank in Go has an ELO value of 2940 – AlphaGo Zero achieves 5185! Even the best players don’t stand a chance against this monster of a machine. The neural network technology that is used for AlphaGo Zero and many other deep neural networks is called Tensorflow, which can also easily be integrated into the R environment. To find out more go here: https://tensorflow.rstudio.com/

In this whole area there are many mind-blowing projects underway, so stay tuned!